DataCentersX > Sites > AI Training Superclusters


AI Training Superclusters


An AI training supercluster is a named, integrated GPU compute fabric purpose-built for training frontier foundation models. Superclusters differ from general-purpose hyperscale data centers in three ways: extreme accelerator density (tens of thousands to hundreds of thousands of GPUs in a single coherent training fabric), specialized interconnect topologies (NVLink Switch domains, InfiniBand or Spectrum-X Ethernet, rail-optimized fabrics), and power and cooling envelopes engineered for sustained high-utilization training workloads rather than diversified cloud serving.

This page covers the named superclusters currently operating or in commissioning. Each entry tracks operator, location, accelerator count and generation, interconnect, power class, and the frontier training programs the cluster has hosted. The companion Frontier Training Runs page covers the events that ran on these clusters - the training jobs themselves with their compute, parameters, and outcomes. Read the two together to map infrastructure to events.


Master ranking

Supercluster Operator Location Scale Notable runs hosted
Colossus xAI Memphis, TN 100K H100 (built in 122 days), expanded to 200K, public roadmap to 1M+ Grok 2, Grok 3, Grok 3 Reasoning, Grok 4
Colossus 2 xAI Memphis, TN (Phase 2 site) First gigawatt-class training cluster operational; combined Colossus 1+2 exceed 1M H100-equivalent GPUs; 1.5 GW upgrade roadmap Grok 5 training underway
Stargate (Abilene) OpenAI / Oracle / SoftBank Abilene, TX (anchor site, multi-site expansion) Multi-gigawatt program; first phase operational; expansion sites announced across US GPT-5, ongoing OpenAI frontier training
Microsoft Eagle / Azure AI superclusters Microsoft Multiple Azure regions; Mt. Pleasant WI; Quincy WA expansion Multi-cluster fleet; original GPT-4 training cluster ~25K A100s; Blackwell generation deploying at scale GPT-4, GPT-4o, GPT-4.5 training (OpenAI partnership era); internal Microsoft AI workloads
Meta Hyperion Meta Richland Parish, LA 5 GW+ planned; renewable-first hyperscale AI build Llama 4 family (Scout, Maverick, Behemoth)
Meta Research SuperCluster (RSC) Meta Multiple Meta data centers Originally 16K A100s; expanded with H100 capacity; >100K H100 equivalent across Llama 4 training campuign Llama 2, Llama 3, Llama 3.1 405B, Llama 4 Behemoth (32K GPUs FP8)
Google TPU pods (multi-site) Google Council Bluffs IA, The Dalles OR, Hamina FI, multiple regions TPU v4, v5e, v5p, v6 (Trillium), v7 generations; pod sizes scaling with each generation Gemini Ultra, Gemini 1.5, Gemini 2.0, Gemini 2.5, internal Google research models
Tesla Dojo / Cortex Tesla Giga Texas (Austin) GPU-heavy training capacity; D3/Dojo3 silicon roadmap from Terafab FSD foundation models; Optimus humanoid models
Anthropic training infrastructure (AWS / Google) Anthropic (on AWS Trainium and Google TPU) AWS regions (Trainium); Google regions (TPU) Multi-billion-dollar AWS commitment; Project Rainier with AWS targets 1M+ Trainium2 chips; Google TPU access for parallel training Claude family (Sonnet, Opus generations through Opus 4.7)
CoreWeave AI clusters CoreWeave Multiple US data centers Largest AI neo-cloud; H100, H200, B200 fleets; multi-billion-dollar Microsoft and OpenAI commitments Microsoft AI workloads, OpenAI training capacity, enterprise AI training tenants
Oracle OCI AI clusters Oracle Multiple OCI regions; Stargate Abilene partnership Large H100/H200/B200 deployments; Stargate co-developer OpenAI training capacity (Stargate Abilene), enterprise tenants
Lambda AI clusters Lambda Multiple US data centers GPU cloud focused on AI training and inference AI startup and enterprise training tenants
Crusoe AI clusters Crusoe Energy Stranded-energy sites; Stargate Abilene partnership Energy-integrated AI infrastructure; flared-gas-to-compute origins; Stargate site co-development OpenAI Stargate workloads; enterprise AI training
Nebius AI clusters Nebius (ex-Yandex spinout) Europe (Finland anchor) European AI cloud; NVIDIA partnership; H100 fleet expansion European AI training tenants; sovereign-aligned workloads

By interconnect topology

Interconnect choice shapes what training jobs a supercluster can run efficiently. Three architectures dominate at frontier scale: NVIDIA NVLink Switch domains for tightly-coupled multi-GPU coherence, InfiniBand for traditional HPC-style high-bandwidth fabrics, and NVIDIA Spectrum-X Ethernet for cost-optimized AI fabrics at extreme scale.

Topology Where it runs Distinctive
NVLink Switch GB200 NVL72 reference designs; Rubin reference designs; tightly-coupled multi-GPU domains Memory-coherent across 72 GPUs at rack scale; baseline for new Blackwell and Rubin deployments
InfiniBand (Quantum-2 400G, Quantum-X 800G) Most major H100 superclusters; OpenAI Stargate; Microsoft Azure AI; Meta RSC Native RDMA, low latency; mature ecosystem at HPC and AI training scale
Spectrum-X Ethernet xAI Colossus (100K-200K H100 fabric); some Microsoft and CoreWeave deployments Standards-based 800G Ethernet with RDMA; xAI achieves 95% data throughput vs ~60% for standard Ethernet
Google ICI (Inter-Chip Interconnect) Google TPU pods (all generations) Google-internal proprietary interconnect; tight coupling across thousands of TPU dies
AWS EFA (Elastic Fabric Adapter) AWS Trainium clusters; some H100 capacity on AWS AWS-internal RDMA-equivalent fabric; pairs with Trainium2 for Anthropic Project Rainier

By accelerator generation

Most operating frontier-tier superclusters were built around NVIDIA H100. Blackwell (B200, GB200) deployment is now the dominant new-build generation, with Rubin reference designs entering customer hands in 2026. Google operates a parallel TPU track. AWS Trainium and a small number of internal hyperscaler accelerators (Microsoft Maia, Meta MTIA) are also at scale though typically not for the largest frontier training runs.

Accelerator Vendor Where deployed at scale
A100 NVIDIA Original GPT-4 training cluster (Microsoft Azure ~25K A100); Meta RSC; Google legacy AI; widely retired or repurposed
H100 NVIDIA xAI Colossus 100K-200K; Meta Llama 4 cluster (>100K); Microsoft Azure; Google Cloud GPU; CoreWeave; Lambda; broadly the H100 era cluster baseline
H200 NVIDIA Mid-cycle upgrade between H100 and Blackwell; CoreWeave, Lambda, Oracle deployments; some hyperscaler upgrades
B200 / GB200 (Blackwell) NVIDIA Stargate Abilene; Colossus 2; Microsoft Azure; CoreWeave; broad hyperscaler buildout 2025-2026
Rubin (R100) NVIDIA Reference designs entering customer hands; 288 GB HBM4 per GPU; Q1 2026 production start
TPU v4 / v5p / Trillium / v7 Google Google internal training; Anthropic via Google Cloud; select external customers
Trainium2 AWS Anthropic Project Rainier (~1M+ chip target); growing AWS internal AI workloads
MI300X / MI325X / MI350 AMD Microsoft Azure (selected); Oracle; Meta inference; growing AI training adoption
Maia 100 / Maia 200 Microsoft Microsoft internal AI workloads; Maia 200 on TSMC 3nm with 216 GB HBM3e
MTIA Meta Meta internal inference and growing training workloads

What makes a supercluster distinct from a hyperscaler region

A traditional hyperscaler region is optimized for diversified cloud workloads with N+1 redundancy, multiple availability zones, and customer mix across compute, storage, networking, and PaaS services. A training supercluster is optimized differently. The compute is bursty and high-utilization rather than diversified. The fabric is engineered for tens of thousands of synchronized GPU updates per second across the full cluster, not for inter-tenant isolation. The power envelope is concentrated at the rack level (50-130 kW per Blackwell rack vs single-digit kW for traditional racks). The cooling is direct-to-chip liquid in most new builds, not air. The result is a fundamentally different facility class even when located within or adjacent to a hyperscaler campus.

The DX Types pillar has a dedicated AI Factory child that covers the broader category of facilities optimized for AI training and inference at scale. AI training superclusters are the densest specific instances of the AI Factory category - the maximum-density configuration where the architectural lessons get learned first.


Where this fits

This page covers infrastructure (named superclusters with stable attributes). The Frontier Training Runs page covers events (specific training jobs that consumed compute on these clusters). Reading the two together maps the AI infrastructure to the AI events. Cross-pillar references run through Types:AI Factory for the facility class, Sites for the specific named campuses, Bottleneck Atlas for the supply chain dependencies, and SX:NVIDIA Spotlight for the silicon side.


Related coverage

Frontier Training Runs | AI Factory | Sites | xAI Colossus | Stargate | Meta Hyperion | Tesla Dojo | Bottleneck Atlas | SX:NVIDIA Spotlight | SX:HBM | SX:CoWoS