DataCentersX > Sites > AI Training Superclusters
AI Training Superclusters
An AI training supercluster is a named, integrated GPU compute fabric purpose-built for training frontier foundation models. Superclusters differ from general-purpose hyperscale data centers in three ways: extreme accelerator density (tens of thousands to hundreds of thousands of GPUs in a single coherent training fabric), specialized interconnect topologies (NVLink Switch domains, InfiniBand or Spectrum-X Ethernet, rail-optimized fabrics), and power and cooling envelopes engineered for sustained high-utilization training workloads rather than diversified cloud serving.
This page covers the named superclusters currently operating or in commissioning. Each entry tracks operator, location, accelerator count and generation, interconnect, power class, and the frontier training programs the cluster has hosted. The companion Frontier Training Runs page covers the events that ran on these clusters - the training jobs themselves with their compute, parameters, and outcomes. Read the two together to map infrastructure to events.
Master ranking
| Supercluster | Operator | Location | Scale | Notable runs hosted |
|---|---|---|---|---|
| Colossus | xAI | Memphis, TN | 100K H100 (built in 122 days), expanded to 200K, public roadmap to 1M+ | Grok 2, Grok 3, Grok 3 Reasoning, Grok 4 |
| Colossus 2 | xAI | Memphis, TN (Phase 2 site) | First gigawatt-class training cluster operational; combined Colossus 1+2 exceed 1M H100-equivalent GPUs; 1.5 GW upgrade roadmap | Grok 5 training underway |
| Stargate (Abilene) | OpenAI / Oracle / SoftBank | Abilene, TX (anchor site, multi-site expansion) | Multi-gigawatt program; first phase operational; expansion sites announced across US | GPT-5, ongoing OpenAI frontier training |
| Microsoft Eagle / Azure AI superclusters | Microsoft | Multiple Azure regions; Mt. Pleasant WI; Quincy WA expansion | Multi-cluster fleet; original GPT-4 training cluster ~25K A100s; Blackwell generation deploying at scale | GPT-4, GPT-4o, GPT-4.5 training (OpenAI partnership era); internal Microsoft AI workloads |
| Meta Hyperion | Meta | Richland Parish, LA | 5 GW+ planned; renewable-first hyperscale AI build | Llama 4 family (Scout, Maverick, Behemoth) |
| Meta Research SuperCluster (RSC) | Meta | Multiple Meta data centers | Originally 16K A100s; expanded with H100 capacity; >100K H100 equivalent across Llama 4 training campuign | Llama 2, Llama 3, Llama 3.1 405B, Llama 4 Behemoth (32K GPUs FP8) |
| Google TPU pods (multi-site) | Council Bluffs IA, The Dalles OR, Hamina FI, multiple regions | TPU v4, v5e, v5p, v6 (Trillium), v7 generations; pod sizes scaling with each generation | Gemini Ultra, Gemini 1.5, Gemini 2.0, Gemini 2.5, internal Google research models | |
| Tesla Dojo / Cortex | Tesla | Giga Texas (Austin) | GPU-heavy training capacity; D3/Dojo3 silicon roadmap from Terafab | FSD foundation models; Optimus humanoid models |
| Anthropic training infrastructure (AWS / Google) | Anthropic (on AWS Trainium and Google TPU) | AWS regions (Trainium); Google regions (TPU) | Multi-billion-dollar AWS commitment; Project Rainier with AWS targets 1M+ Trainium2 chips; Google TPU access for parallel training | Claude family (Sonnet, Opus generations through Opus 4.7) |
| CoreWeave AI clusters | CoreWeave | Multiple US data centers | Largest AI neo-cloud; H100, H200, B200 fleets; multi-billion-dollar Microsoft and OpenAI commitments | Microsoft AI workloads, OpenAI training capacity, enterprise AI training tenants |
| Oracle OCI AI clusters | Oracle | Multiple OCI regions; Stargate Abilene partnership | Large H100/H200/B200 deployments; Stargate co-developer | OpenAI training capacity (Stargate Abilene), enterprise tenants |
| Lambda AI clusters | Lambda | Multiple US data centers | GPU cloud focused on AI training and inference | AI startup and enterprise training tenants |
| Crusoe AI clusters | Crusoe Energy | Stranded-energy sites; Stargate Abilene partnership | Energy-integrated AI infrastructure; flared-gas-to-compute origins; Stargate site co-development | OpenAI Stargate workloads; enterprise AI training |
| Nebius AI clusters | Nebius (ex-Yandex spinout) | Europe (Finland anchor) | European AI cloud; NVIDIA partnership; H100 fleet expansion | European AI training tenants; sovereign-aligned workloads |
By interconnect topology
Interconnect choice shapes what training jobs a supercluster can run efficiently. Three architectures dominate at frontier scale: NVIDIA NVLink Switch domains for tightly-coupled multi-GPU coherence, InfiniBand for traditional HPC-style high-bandwidth fabrics, and NVIDIA Spectrum-X Ethernet for cost-optimized AI fabrics at extreme scale.
| Topology | Where it runs | Distinctive |
|---|---|---|
| NVLink Switch | GB200 NVL72 reference designs; Rubin reference designs; tightly-coupled multi-GPU domains | Memory-coherent across 72 GPUs at rack scale; baseline for new Blackwell and Rubin deployments |
| InfiniBand (Quantum-2 400G, Quantum-X 800G) | Most major H100 superclusters; OpenAI Stargate; Microsoft Azure AI; Meta RSC | Native RDMA, low latency; mature ecosystem at HPC and AI training scale |
| Spectrum-X Ethernet | xAI Colossus (100K-200K H100 fabric); some Microsoft and CoreWeave deployments | Standards-based 800G Ethernet with RDMA; xAI achieves 95% data throughput vs ~60% for standard Ethernet |
| Google ICI (Inter-Chip Interconnect) | Google TPU pods (all generations) | Google-internal proprietary interconnect; tight coupling across thousands of TPU dies |
| AWS EFA (Elastic Fabric Adapter) | AWS Trainium clusters; some H100 capacity on AWS | AWS-internal RDMA-equivalent fabric; pairs with Trainium2 for Anthropic Project Rainier |
By accelerator generation
Most operating frontier-tier superclusters were built around NVIDIA H100. Blackwell (B200, GB200) deployment is now the dominant new-build generation, with Rubin reference designs entering customer hands in 2026. Google operates a parallel TPU track. AWS Trainium and a small number of internal hyperscaler accelerators (Microsoft Maia, Meta MTIA) are also at scale though typically not for the largest frontier training runs.
| Accelerator | Vendor | Where deployed at scale |
|---|---|---|
| A100 | NVIDIA | Original GPT-4 training cluster (Microsoft Azure ~25K A100); Meta RSC; Google legacy AI; widely retired or repurposed |
| H100 | NVIDIA | xAI Colossus 100K-200K; Meta Llama 4 cluster (>100K); Microsoft Azure; Google Cloud GPU; CoreWeave; Lambda; broadly the H100 era cluster baseline |
| H200 | NVIDIA | Mid-cycle upgrade between H100 and Blackwell; CoreWeave, Lambda, Oracle deployments; some hyperscaler upgrades |
| B200 / GB200 (Blackwell) | NVIDIA | Stargate Abilene; Colossus 2; Microsoft Azure; CoreWeave; broad hyperscaler buildout 2025-2026 |
| Rubin (R100) | NVIDIA | Reference designs entering customer hands; 288 GB HBM4 per GPU; Q1 2026 production start |
| TPU v4 / v5p / Trillium / v7 | Google internal training; Anthropic via Google Cloud; select external customers | |
| Trainium2 | AWS | Anthropic Project Rainier (~1M+ chip target); growing AWS internal AI workloads |
| MI300X / MI325X / MI350 | AMD | Microsoft Azure (selected); Oracle; Meta inference; growing AI training adoption |
| Maia 100 / Maia 200 | Microsoft | Microsoft internal AI workloads; Maia 200 on TSMC 3nm with 216 GB HBM3e |
| MTIA | Meta | Meta internal inference and growing training workloads |
What makes a supercluster distinct from a hyperscaler region
A traditional hyperscaler region is optimized for diversified cloud workloads with N+1 redundancy, multiple availability zones, and customer mix across compute, storage, networking, and PaaS services. A training supercluster is optimized differently. The compute is bursty and high-utilization rather than diversified. The fabric is engineered for tens of thousands of synchronized GPU updates per second across the full cluster, not for inter-tenant isolation. The power envelope is concentrated at the rack level (50-130 kW per Blackwell rack vs single-digit kW for traditional racks). The cooling is direct-to-chip liquid in most new builds, not air. The result is a fundamentally different facility class even when located within or adjacent to a hyperscaler campus.
The DX Types pillar has a dedicated AI Factory child that covers the broader category of facilities optimized for AI training and inference at scale. AI training superclusters are the densest specific instances of the AI Factory category - the maximum-density configuration where the architectural lessons get learned first.
Where this fits
This page covers infrastructure (named superclusters with stable attributes). The Frontier Training Runs page covers events (specific training jobs that consumed compute on these clusters). Reading the two together maps the AI infrastructure to the AI events. Cross-pillar references run through Types:AI Factory for the facility class, Sites for the specific named campuses, Bottleneck Atlas for the supply chain dependencies, and SX:NVIDIA Spotlight for the silicon side.
Related coverage
Frontier Training Runs | AI Factory | Sites | xAI Colossus | Stargate | Meta Hyperion | Tesla Dojo | Bottleneck Atlas | SX:NVIDIA Spotlight | SX:HBM | SX:CoWoS