Deployment Case Study: Tesla Dojo
Tesla’s Dojo is a vertically integrated AI training cluster designed for full self-driving (FSD) and autonomy workloads. Unlike hyperscalers who rely almost exclusively on NVIDIA H100 GPU clusters, Tesla built its own custom D1 chip, training tiles, and cabinets, deploying them at facilities like Cortex (Austin, TX) and Colossus (Memphis, TN). This case study explores Tesla’s architectural choices, vertical integration scope, deployments, and how Dojo differs from conventional GPU-based AI factories.
Tesla is unique among enterprises in pursuing full vertical integration of AI infrastructure — spanning hyperscale training clusters, fleet-wide over-the-air (OTA) model distribution, real-time inference on vehicles and robots, and closed-loop telemetry feedback. While most companies rely on hyperscaler APIs for inference or contract data center providers, Tesla builds and operates its own AI “factories” and inference hardware stack. This makes Tesla both a data center operator and an AI endpoint manufacturer.
Overview
- Training Infrastructure: Cortex (Austin) and Colossus 1 (Memphis, operational), Colossus 2 (in development).
- Distribution: OTA model updates pushed directly to millions of vehicles and emerging humanoid robots.
- On-Device Inference: Tesla FSD Computer (HW3–HW5 / AI5) deployed in cars, Semi, and Optimus robots.
- Feedback Loop: Fleet telemetry and video clips uploaded to clusters for continuous retraining.
- Scale: Billions of training video clips, millions of active inference endpoints.
Dojo Architecture
Element | Specs | Role |
---|---|---|
D1 Chip | 7nm TSMC, 362 TFLOPs, 576 cores | Custom AI accelerator optimized for matrix ops |
Training Tile | 25 D1 chips, 9 PFLOPs per tile | Building block for cabinets |
Cabinet (Dojo ExaCab) | 120 tiles, ~1 ExaFLOP per cabinet | Rack-scale integration with liquid cooling |
ExaPOD | 10 cabinets, ~10 ExaFLOPs | Cluster-scale deployment, target scale per site |
Vertical Integration Scope
- In-House: D1 chip design, training tiles, cabinet-level integration, custom liquid cooling.
- Externally Sourced: Advanced packaging (TSMC), networking switches, facility power and cooling plants, site integration.
- Approach: Blend of custom silicon + commodity infrastructure.
Deployment Pipeline
Stage | Location / Platform | Function | Notes |
---|---|---|---|
Training | Cortex (Austin); Colossus 1 (Memphis, live); Colossus 2 (in dev) | GPU clusters train perception and planning models | Dojo program shut down 2025, pivot to GPU-heavy design |
Model Distribution | Tesla Cloud + Enterprise IT | OTA updates pushed to global fleet | Unique: continuous integration to millions of endpoints |
On-Device Inference | Tesla FSD Computer (HW5/AI5 in cars, Semi, Optimus) | Sub-50 ms perception + control inference | Latency-critical, cannot rely on remote DCs |
Telemetry Feedback | Fleet uploads to Cortex/Colossus | Edge cases & disengagements enrich training data | Billions of real-world scenarios processed |
Comparison to GPU Clusters
Dimension | Tesla Dojo | NVIDIA GPU Cluster |
---|---|---|
Compute Unit | D1 chip, 25 per tile | NVIDIA H100 GPU |
Integration | Custom tiles ? cabinets ? ExaPOD | DGX/HGX servers ? racks ? pods |
Fabric | Custom mesh, 36 TB/s per tile | InfiniBand NDR, 400/800G Ethernet |
Software | Custom compilers, PyTorch integration | CUDA/cuDNN stack |
Cooling | Custom liquid-cooled cabinets | Air/liquid hybrids, immersion emerging |
Supply Model | Tesla vertical integration | Vendor supply (NVIDIA + OEMs) |
Tesla vs Other Automakers and Tech Firms
Tesla’s closed-loop AI pipeline differs from other automakers and big tech players:
Company | Training Data Centers | OTA Model Updates | On-Device Inference | Telemetry Feedback Loop | Vertical Integration Level |
---|---|---|---|---|---|
Tesla | Cortex (Austin), Colossus 1 & 2 (Memphis) | Yes — continuous OTA updates to fleet | Yes — FSD HW5/AI5 in cars, Semi, Optimus | Yes — video + sensor telemetry uploaded for retraining | Full vertical integration |
Waymo (Alphabet) | Google Cloud / TPU clusters | Yes — but limited fleet scale | Yes — on-vehicle inference (custom compute stack) | Yes — telemetry collected, not at Tesla’s scale | Partial (relies on Google Cloud, not independent DCs) |
Cruise (GM) | Mixed colo + hyperscaler (Azure) | Yes — OTA updates to AV fleet | Yes — in-car compute platforms | Yes — telemetry uploaded to cloud | Moderate (outsourced training infra) |
Apple | Apple DCs (Siri/AI services) | Yes — OTA to iPhones/iPads/Macs | Yes — Apple Neural Engine (on-device) | Yes — telemetry for product improvement | High (consumer ecosystem, not mobility) |
Meta | Meta AI Data Centers (training LLaMA) | Limited OTA to apps, not physical devices | No dedicated hardware (runs on general devices) | Partial — app usage telemetry only | Low/Medium (focus on cloud APIs) |
OpenAI + Microsoft | Azure superclusters (GPU-based) | No OTA fleet; API delivery only | No — inference delivered via cloud APIs | Yes — API usage data feeds back into training | Low (cloud service model, not device integration) |
NVIDIA | Selene + partner DCs (training reference) | No OTA fleet; provides SDKs | Yes — Jetson, DGX inference platforms | Indirect — partners handle telemetry | Partial (chip + software vendor role) |
Key Challenges
- Energy Intensity: Training clusters like Colossus require GW-scale power and advanced cooling.
- Dojo Transition: Loss of in-house training silicon raises dependency on GPU supply chains.
- OTA Risk: Live fleet updates must balance innovation with safety/regulatory compliance.
- Edge Case Explosion: Training requires sifting billions of video clips for rare scenarios.
- Regulatory Scrutiny: FSD inference systems under global safety and compliance review.
Strategic Importance
- Closed Loop: Tesla uniquely integrates training, inference, and telemetry feedback.
- Vertical Integration: Control from datacenter silicon to in-car inference hardware.
- Fleet Scale: Millions of devices act as both inference endpoints and data generators.
- First Mover: No other automaker operates dedicated AI training campuses of this scale.
Dojo End
In August 2025, Tesla officially disbanded the Dojo project, ending its four-year effort to build a vertically integrated AI training supercomputer. Elon Musk described Dojo as an “evolutionary dead end,” citing multiple factors that made it less viable than expected:
- Chip Development Challenges: The custom D1 silicon, manufactured at TSMC, struggled to keep pace with the rapid cadence of NVIDIA’s GPU releases. Yield and packaging complexity further slowed scale-up.
- Software Ecosystem Limitations: CUDA’s dominance across the AI community limited Dojo’s adoption beyond Tesla. Building and maintaining a separate compiler stack proved costly and isolated.
- Performance Parity: By mid-2025, NVIDIA’s H100 and upcoming H200 platforms, combined with mature InfiniBand/Ethernet fabrics, offered superior performance-per-watt and software support compared to Dojo tiles and ExaPODs.
- Strategic Shift: Tesla chose to redirect resources into Cortex (its GPU-based cluster in Austin) and to accelerate work on AI5/AI6 inference chips for vehicles and robots. This pivot aligned Tesla’s roadmap with broader industry standards and supplier ecosystems.
- Opportunity Cost: Building a custom supercomputer pulled engineering talent and capital away from vehicle programs, Optimus humanoid robot development, and energy products—areas Tesla prioritized for near-term revenue.
The decision underscored a pragmatic shift: instead of competing head-to-head with NVIDIA in training silicon, Tesla would focus on optimizing inference silicon for its fleets, while relying on GPU-based clusters like Cortex for training. Dojo’s facilities and lessons learned were partially absorbed into Cortex and Colossus deployments, but the custom D1 chip and ExaPOD vision were retired.