DataCentersX > Sites > AI Training Superclusters

AI Training Superclusters

An AI training supercluster is a named, integrated GPU compute fabric purpose-built for training frontier foundation models. Superclusters differ from general-purpose hyperscale data centers in three ways: extreme accelerator density (tens of thousands to hundreds of thousands of GPUs in a single coherent training fabric), specialized interconnect topologies (NVLink Switch domains, InfiniBand or Spectrum-X Ethernet, rail-optimized fabrics), and power and cooling envelopes engineered for sustained high-utilization training workloads rather than diversified cloud serving.

This page covers the named superclusters currently operating or in commissioning. Each entry tracks operator, location, accelerator count and generation, interconnect, power class, and the frontier training programs the cluster has hosted. The companion Frontier Training Runs page covers the events that ran on these clusters - the training jobs themselves with their compute, parameters, and outcomes. Read the two together to map infrastructure to events.

Master ranking

Supercluster	Operator	Location	Scale	Notable runs hosted
Colossus	xAI	Memphis, TN	100K H100 (built in 122 days), expanded to 200K, public roadmap to 1M+	Grok 2, Grok 3, Grok 3 Reasoning, Grok 4
Colossus 2	xAI	Memphis, TN (Phase 2 site)	First gigawatt-class training cluster operational; combined Colossus 1+2 exceed 1M H100-equivalent GPUs; 1.5 GW upgrade roadmap	Grok 5 training underway
Stargate (Abilene)	OpenAI / Oracle / SoftBank	Abilene, TX (anchor site, multi-site expansion)	Multi-gigawatt program; first phase operational; expansion sites announced across US	GPT-5, ongoing OpenAI frontier training
Microsoft Eagle / Azure AI superclusters	Microsoft	Multiple Azure regions; Mt. Pleasant WI; Quincy WA expansion	Multi-cluster fleet; original GPT-4 training cluster ~25K A100s; Blackwell generation deploying at scale	GPT-4, GPT-4o, GPT-4.5 training (OpenAI partnership era); internal Microsoft AI workloads
Meta Hyperion	Meta	Richland Parish, LA	5 GW+ planned; renewable-first hyperscale AI build	Llama 4 family (Scout, Maverick, Behemoth)
Meta Research SuperCluster (RSC)	Meta	Multiple Meta data centers	Originally 16K A100s; expanded with H100 capacity; >100K H100 equivalent across Llama 4 training campuign	Llama 2, Llama 3, Llama 3.1 405B, Llama 4 Behemoth (32K GPUs FP8)
Google TPU pods (multi-site)	Google	Council Bluffs IA, The Dalles OR, Hamina FI, multiple regions	TPU v4, v5e, v5p, v6 (Trillium), v7 generations; pod sizes scaling with each generation	Gemini Ultra, Gemini 1.5, Gemini 2.0, Gemini 2.5, internal Google research models
Tesla Dojo / Cortex	Tesla	Giga Texas (Austin)	GPU-heavy training capacity; D3/Dojo3 silicon roadmap from Terafab	FSD foundation models; Optimus humanoid models
Anthropic training infrastructure (AWS / Google)	Anthropic (on AWS Trainium and Google TPU)	AWS regions (Trainium); Google regions (TPU)	Multi-billion-dollar AWS commitment; Project Rainier with AWS targets 1M+ Trainium2 chips; Google TPU access for parallel training	Claude family (Sonnet, Opus generations through Opus 4.7)
CoreWeave AI clusters	CoreWeave	Multiple US data centers	Largest AI neo-cloud; H100, H200, B200 fleets; multi-billion-dollar Microsoft and OpenAI commitments	Microsoft AI workloads, OpenAI training capacity, enterprise AI training tenants
Oracle OCI AI clusters	Oracle	Multiple OCI regions; Stargate Abilene partnership	Large H100/H200/B200 deployments; Stargate co-developer	OpenAI training capacity (Stargate Abilene), enterprise tenants
Lambda AI clusters	Lambda	Multiple US data centers	GPU cloud focused on AI training and inference	AI startup and enterprise training tenants
Crusoe AI clusters	Crusoe Energy	Stranded-energy sites; Stargate Abilene partnership	Energy-integrated AI infrastructure; flared-gas-to-compute origins; Stargate site co-development	OpenAI Stargate workloads; enterprise AI training
Nebius AI clusters	Nebius (ex-Yandex spinout)	Europe (Finland anchor)	European AI cloud; NVIDIA partnership; H100 fleet expansion	European AI training tenants; sovereign-aligned workloads

By interconnect topology

Interconnect choice shapes what training jobs a supercluster can run efficiently. Three architectures dominate at frontier scale: NVIDIA NVLink Switch domains for tightly-coupled multi-GPU coherence, InfiniBand for traditional HPC-style high-bandwidth fabrics, and NVIDIA Spectrum-X Ethernet for cost-optimized AI fabrics at extreme scale.

Topology	Where it runs	Distinctive
NVLink Switch	GB200 NVL72 reference designs; Rubin reference designs; tightly-coupled multi-GPU domains	Memory-coherent across 72 GPUs at rack scale; baseline for new Blackwell and Rubin deployments
InfiniBand (Quantum-2 400G, Quantum-X 800G)	Most major H100 superclusters; OpenAI Stargate; Microsoft Azure AI; Meta RSC	Native RDMA, low latency; mature ecosystem at HPC and AI training scale
Spectrum-X Ethernet	xAI Colossus (100K-200K H100 fabric); some Microsoft and CoreWeave deployments	Standards-based 800G Ethernet with RDMA; xAI achieves 95% data throughput vs ~60% for standard Ethernet
Google ICI (Inter-Chip Interconnect)	Google TPU pods (all generations)	Google-internal proprietary interconnect; tight coupling across thousands of TPU dies
AWS EFA (Elastic Fabric Adapter)	AWS Trainium clusters; some H100 capacity on AWS	AWS-internal RDMA-equivalent fabric; pairs with Trainium2 for Anthropic Project Rainier

By accelerator generation

Most operating frontier-tier superclusters were built around NVIDIA H100. Blackwell (B200, GB200) deployment is now the dominant new-build generation, with Rubin reference designs entering customer hands in 2026. Google operates a parallel TPU track. AWS Trainium and a small number of internal hyperscaler accelerators (Microsoft Maia, Meta MTIA) are also at scale though typically not for the largest frontier training runs.

Accelerator	Vendor	Where deployed at scale
A100	NVIDIA	Original GPT-4 training cluster (Microsoft Azure ~25K A100); Meta RSC; Google legacy AI; widely retired or repurposed
H100	NVIDIA	xAI Colossus 100K-200K; Meta Llama 4 cluster (>100K); Microsoft Azure; Google Cloud GPU; CoreWeave; Lambda; broadly the H100 era cluster baseline
H200	NVIDIA	Mid-cycle upgrade between H100 and Blackwell; CoreWeave, Lambda, Oracle deployments; some hyperscaler upgrades
B200 / GB200 (Blackwell)	NVIDIA	Stargate Abilene; Colossus 2; Microsoft Azure; CoreWeave; broad hyperscaler buildout 2025-2026
Rubin (R100)	NVIDIA	Reference designs entering customer hands; 288 GB HBM4 per GPU; Q1 2026 production start
TPU v4 / v5p / Trillium / v7	Google	Google internal training; Anthropic via Google Cloud; select external customers
Trainium2	AWS	Anthropic Project Rainier (~1M+ chip target); growing AWS internal AI workloads
MI300X / MI325X / MI350	AMD	Microsoft Azure (selected); Oracle; Meta inference; growing AI training adoption
Maia 100 / Maia 200	Microsoft	Microsoft internal AI workloads; Maia 200 on TSMC 3nm with 216 GB HBM3e
MTIA	Meta	Meta internal inference and growing training workloads

What makes a supercluster distinct from a hyperscaler region

A traditional hyperscaler region is optimized for diversified cloud workloads with N+1 redundancy, multiple availability zones, and customer mix across compute, storage, networking, and PaaS services. A training supercluster is optimized differently. The compute is bursty and high-utilization rather than diversified. The fabric is engineered for tens of thousands of synchronized GPU updates per second across the full cluster, not for inter-tenant isolation. The power envelope is concentrated at the rack level (50-130 kW per Blackwell rack vs single-digit kW for traditional racks). The cooling is direct-to-chip liquid in most new builds, not air. The result is a fundamentally different facility class even when located within or adjacent to a hyperscaler campus.

The DX Types pillar has a dedicated AI Factory child that covers the broader category of facilities optimized for AI training and inference at scale. AI training superclusters are the densest specific instances of the AI Factory category - the maximum-density configuration where the architectural lessons get learned first.

Where this fits

This page covers infrastructure (named superclusters with stable attributes). The Frontier Training Runs page covers events (specific training jobs that consumed compute on these clusters). Reading the two together maps the AI infrastructure to the AI events. Cross-pillar references run through Types:AI Factory for the facility class, Sites for the specific named campuses, Bottleneck Atlas for the supply chain dependencies, and SX:NVIDIA Spotlight for the silicon side.

Related coverage