DataCentersX > Frontier Training Runs

Frontier AI Training Runs

Frontier training runs are the events through which the largest AI infrastructure in the world gets used. A training run is time-bounded - it has a start, a duration, a compute consumption, and an outcome (a model with measurable capabilities). Unlike the superclusters that host them, training runs are not enduring entities; they happen, consume their compute budget, produce their model, and end. The cluster persists; the run does not.

This page tracks the major training runs of the foundation model era - their compute, parameters, training tokens, host cluster, and capability outcomes. The companion AI Training Superclusters page covers the infrastructure that hosted these runs. New runs get added as they conclude and as details become public. Many training run details remain undisclosed by their operators; this page documents what is publicly known and clearly labels estimates where official figures are not available.

The 1e25 FLOP era

GPT-4 was the first model trained above 1e25 FLOP, a threshold that has since become the industry rough boundary for "frontier" training. As of mid-2025, more than 30 publicly announced models from at least 12 developers have crossed this threshold, with new models added at a rate of roughly two per month through 2024. The 1e25 FLOP threshold also defines the regulatory boundary for the EU AI Act's General-Purpose AI Model with Systemic Risk classification, which adds compliance overhead for models above this scale.

The runs

Model	Operator	Released	Cluster / Hardware	Training compute	Notes
GPT-3	OpenAI	June 2020	Microsoft Azure V100 cluster	~3.14e23 FLOP; 175B parameters; ~300B training tokens	First model to demonstrate emergent capabilities at scale; established the foundation model paradigm
PaLM	Google	April 2022	Two TPU v4 pods, ~6,144 TPU v4 chips	~2.5e24 FLOP; 540B dense parameters; 780B tokens	First publicly disclosed run on TPU v4 pod scale; demonstrated cross-pod training feasibility
GPT-4	OpenAI	March 2023	Microsoft Azure A100 cluster, ~25K A100s	~2.15e25 FLOP; 90-100 days; ~$63M hardware cost; reported as 1.8T parameter MoE with ~280B active (per leaked SemiAnalysis details); 13T training tokens	First model crossing 1e25 FLOP threshold; first widely deployed MoE foundation model; defined the modern frontier era
Gemini Ultra (1.0)	Google	December 2023	Google TPU v4 / v5e pods	Estimated above 1e25 FLOP; multimodal natively from training	First Google response at GPT-4 capability tier; multimodal native architecture
Claude 3 Opus	Anthropic	March 2024	AWS and Google Cloud (TPU and GPU)	Estimated above 1e25 FLOP; details not publicly disclosed	Anthropic's first frontier-tier model; established Claude as a frontier competitor
Llama 3 405B	Meta	July 2024	Meta Research SuperCluster H100 fleet (~16K H100s)	~3.8e25 FLOP; 405B dense parameters; 15T training tokens; 54-day cumulative pretraining	Largest open-weight model at release; established Llama 3 as competitive with closed frontier models
GPT-4o	OpenAI	May 2024	Microsoft Azure H100 capacity	Comparable to GPT-4 scale; native multimodal training	First OpenAI native multimodal model; voice and vision integrated from training
Claude 3.5 Sonnet	Anthropic	June 2024	AWS and Google Cloud	Details not publicly disclosed	Demonstrated that smaller-than-Opus models could match or exceed frontier capability through training improvements
Grok 2	xAI	August 2024	Colossus 100K H100	Estimated above 1e25 FLOP; first major model trained on Colossus	First xAI model at frontier capability tier
o1 / o1-preview	OpenAI	September 2024	Microsoft Azure	Test-time compute scaling rather than pure pretraining scaling	First widely deployed reasoning model; established that capability gains could come from inference-time compute, not just pretraining
Gemini 2.0	Google	December 2024	Google TPU v5p / Trillium	Estimated above 1e25 FLOP; multimodal with agentic capabilities	Google's response to o1 and o3; agentic capabilities natively integrated
DeepSeek-V3	DeepSeek	December 2024	~2,048 H800 GPUs	~5.6e24 FLOP; 671B MoE (37B active); ~$5.6M reported pretraining cost	Demonstrated dramatic cost reduction in frontier training; sparked industry-wide reassessment of compute efficiency
DeepSeek-R1	DeepSeek	January 2025	DeepSeek H800 cluster	Reasoning-focused training on top of V3 base	First open-weight reasoning model at frontier capability tier; competitive with o1 on reasoning benchmarks
Grok 3 / Grok 3 Reasoning	xAI	February 2025	Colossus 100K-200K H100	Pretraining at unprecedented scale per xAI; reasoning variant trained with reinforcement learning	Leveraged Colossus expansion to 200K GPUs; reasoning variant established RL scaling pattern xAI extended into Grok 4
GPT-4.5	OpenAI	February 2025	Microsoft Azure capacity	Estimated ~10x GPT-4 compute (~2e26 FLOP)	Mixed reception sparked industry-wide debate on diminishing returns from pure pretraining scaling
Claude 3.7 Sonnet	Anthropic	February 2025	AWS and Google Cloud	Details not publicly disclosed	Anthropic's first hybrid reasoning model; established competitive coding performance vs OpenAI
Gemini 2.5 Pro	Google	March 2025	Google TPU v5p / Trillium	Estimated above 1e25 FLOP	Established Gemini as competitive frontier model on long-context and reasoning
Llama 4 Behemoth	Meta	April 2025 (in training)	Meta GPU cluster, 32K GPUs FP8 pretraining	>30T training tokens (more than 2x Llama 3); ~2T parameter MoE; achieved 390 TFLOPs/GPU FP8 utilization	Meta's largest training run; teacher model for Llama 4 family distillation
Llama 4 Scout / Maverick	Meta	April 2025	Meta H100 cluster (>100K H100s announced)	Scout: 17B active / 109B total; ~40T training tokens. Maverick: 17B active / 402B total; ~22T training tokens	First Meta production-grade MoE; 10M context window in Scout; open-weight at frontier capability tier
Claude Opus 4	Anthropic	May 2025	AWS Trainium2 (Project Rainier) and Google TPU	Details not publicly disclosed	First major Anthropic model trained substantially on AWS Trainium; established Project Rainier infrastructure
Grok 4 / Grok 4 Heavy	xAI	July 2025	Colossus 200K H100	RL training at pretraining scale; xAI scaled RL compute to match pretraining compute; 6x efficiency improvement reported	First major run scaling RL compute to match pretraining compute; Grok 4 Heavy variant uses parallel test-time compute with multiple agents
GPT-5	OpenAI	Summer 2025	Stargate Abilene + Microsoft Azure	Smaller than the trajectory predicted by 100x-per-generation scaling; reflected industry pivot from pretraining-dominated to mixed pretraining/RL/test-time compute	Reversed prior 100x scaling trend; established the post-pretraining-scaling era
Claude Opus 4.5 / 4.6 / 4.7	Anthropic	2025-2026	AWS Trainium2 (Project Rainier expansion) and Google TPU	Details not publicly disclosed	Iterative Opus generation; Project Rainier scaled to multi-hundred-thousand-Trainium2 capacity for the campaign
Grok 5	xAI	In training (2026)	Colossus 1 + Colossus 2 (combined >1M H100-equivalent); first gigawatt-scale training cluster operational	First training run on a gigawatt-class cluster; ongoing as of early 2026	First training run to use gigawatt-class compute infrastructure; targets capabilities including scientific discovery and autonomous engineering per xAI public statements

Inflection points

Inflection	Marker	What changed
The 1e25 FLOP threshold	GPT-4 (March 2023)	Established the modern frontier; later codified into EU AI Act regulatory threshold
MoE goes mainstream	GPT-4 (leaked architecture); DeepSeek-V3, Llama 4	Sparse activation became the default architecture for cost-efficient scaling
Reasoning models	o1 (September 2024); DeepSeek-R1 (January 2025)	Test-time compute scaling proved capability gains beyond pretraining-only scaling; reasoning RL became a separate training discipline
100K GPU cluster operational	Colossus (built 122 days, summer 2024)	Demonstrated that single-cluster builds at 100K GPU scale were achievable in months rather than years
Cost compression	DeepSeek-V3 ($5.6M pretraining)	Demonstrated frontier capabilities at a fraction of US lab costs; sparked industry reassessment of capital intensity assumptions
Pretraining scaling pause	GPT-4.5 reception (February 2025); GPT-5 deviation from 100x trajectory	Industry-wide pivot from pure pretraining scaling to mixed pretraining + RL + test-time compute; "pretraining isn't dead but isn't sufficient" became the operational consensus
RL at pretraining scale	Grok 4 (July 2025)	First major run scaling RL compute to match pretraining compute; new architectural pattern
Gigawatt-class cluster	Colossus 2 (operational January 2026)	First training cluster crossing 1 GW; established the next infrastructure tier

What the run history reveals

Three patterns run through the training-run timeline. First, compute consumption per run grew roughly 100x per generation from GPT-3 (3.14e23 FLOP) through GPT-4 (2.15e25 FLOP) - then deviated. GPT-4.5 was approximately 10x GPT-4, and GPT-5 reportedly came in below the 100x trajectory entirely. The 100x-per-generation scaling law that defined the 2020-2023 era visibly broke around 2024-2025, replaced by mixed strategies combining pretraining, RL, and test-time compute.

Second, the cost-to-capability ratio compressed dramatically over the same window. GPT-4 cost approximately $63M in compute. DeepSeek-V3 reported approximately $5.6M for comparable benchmark performance two years later. The compression came from architectural innovation (MoE, sparse activation), training efficiency improvements (FP8, better data curation), and hardware progress. The implication is that frontier capability is becoming achievable at smaller compute budgets, even as the absolute compute frontier keeps moving up.

Third, the infrastructure has consolidated around fewer, larger clusters. The early frontier era used multiple smaller clusters across multiple operators. The 2024-2026 era has shifted toward named flagship clusters (Colossus, Stargate Abilene, Meta Hyperion) where billions of dollars of capex concentrate at a single site or a small number of sites. The trend reflects both technical advantages of training at single-site scale and the practical reality that gigawatt-class deployments cannot be replicated at every operator's site simultaneously.

Where this fits

This page covers events. The AI Training Superclusters page covers the infrastructure that hosted them. The Sites pillar covers the named campuses that contain those superclusters. The Bottleneck Atlas covers the supply chain dependencies (HBM, CoWoS, GPUs, transformers) that gate which runs can actually happen. Cross-network references run to SX:NVIDIA Spotlight for the silicon side and EX:Nuclear Energy for the power infrastructure that increasingly anchors training cluster siting.

Related coverage