DataCentersX > Frontier Training Runs


Frontier AI Training Runs


Frontier training runs are the events through which the largest AI infrastructure in the world gets used. A training run is time-bounded - it has a start, a duration, a compute consumption, and an outcome (a model with measurable capabilities). Unlike the superclusters that host them, training runs are not enduring entities; they happen, consume their compute budget, produce their model, and end. The cluster persists; the run does not.

This page tracks the major training runs of the foundation model era - their compute, parameters, training tokens, host cluster, and capability outcomes. The companion AI Training Superclusters page covers the infrastructure that hosted these runs. New runs get added as they conclude and as details become public. Many training run details remain undisclosed by their operators; this page documents what is publicly known and clearly labels estimates where official figures are not available.


The 1e25 FLOP era

GPT-4 was the first model trained above 1e25 FLOP, a threshold that has since become the industry rough boundary for "frontier" training. As of mid-2025, more than 30 publicly announced models from at least 12 developers have crossed this threshold, with new models added at a rate of roughly two per month through 2024. The 1e25 FLOP threshold also defines the regulatory boundary for the EU AI Act's General-Purpose AI Model with Systemic Risk classification, which adds compliance overhead for models above this scale.


The runs

Model Operator Released Cluster / Hardware Training compute Notes
GPT-3 OpenAI June 2020 Microsoft Azure V100 cluster ~3.14e23 FLOP; 175B parameters; ~300B training tokens First model to demonstrate emergent capabilities at scale; established the foundation model paradigm
PaLM Google April 2022 Two TPU v4 pods, ~6,144 TPU v4 chips ~2.5e24 FLOP; 540B dense parameters; 780B tokens First publicly disclosed run on TPU v4 pod scale; demonstrated cross-pod training feasibility
GPT-4 OpenAI March 2023 Microsoft Azure A100 cluster, ~25K A100s ~2.15e25 FLOP; 90-100 days; ~$63M hardware cost; reported as 1.8T parameter MoE with ~280B active (per leaked SemiAnalysis details); 13T training tokens First model crossing 1e25 FLOP threshold; first widely deployed MoE foundation model; defined the modern frontier era
Gemini Ultra (1.0) Google December 2023 Google TPU v4 / v5e pods Estimated above 1e25 FLOP; multimodal natively from training First Google response at GPT-4 capability tier; multimodal native architecture
Claude 3 Opus Anthropic March 2024 AWS and Google Cloud (TPU and GPU) Estimated above 1e25 FLOP; details not publicly disclosed Anthropic's first frontier-tier model; established Claude as a frontier competitor
Llama 3 405B Meta July 2024 Meta Research SuperCluster H100 fleet (~16K H100s) ~3.8e25 FLOP; 405B dense parameters; 15T training tokens; 54-day cumulative pretraining Largest open-weight model at release; established Llama 3 as competitive with closed frontier models
GPT-4o OpenAI May 2024 Microsoft Azure H100 capacity Comparable to GPT-4 scale; native multimodal training First OpenAI native multimodal model; voice and vision integrated from training
Claude 3.5 Sonnet Anthropic June 2024 AWS and Google Cloud Details not publicly disclosed Demonstrated that smaller-than-Opus models could match or exceed frontier capability through training improvements
Grok 2 xAI August 2024 Colossus 100K H100 Estimated above 1e25 FLOP; first major model trained on Colossus First xAI model at frontier capability tier
o1 / o1-preview OpenAI September 2024 Microsoft Azure Test-time compute scaling rather than pure pretraining scaling First widely deployed reasoning model; established that capability gains could come from inference-time compute, not just pretraining
Gemini 2.0 Google December 2024 Google TPU v5p / Trillium Estimated above 1e25 FLOP; multimodal with agentic capabilities Google's response to o1 and o3; agentic capabilities natively integrated
DeepSeek-V3 DeepSeek December 2024 ~2,048 H800 GPUs ~5.6e24 FLOP; 671B MoE (37B active); ~$5.6M reported pretraining cost Demonstrated dramatic cost reduction in frontier training; sparked industry-wide reassessment of compute efficiency
DeepSeek-R1 DeepSeek January 2025 DeepSeek H800 cluster Reasoning-focused training on top of V3 base First open-weight reasoning model at frontier capability tier; competitive with o1 on reasoning benchmarks
Grok 3 / Grok 3 Reasoning xAI February 2025 Colossus 100K-200K H100 Pretraining at unprecedented scale per xAI; reasoning variant trained with reinforcement learning Leveraged Colossus expansion to 200K GPUs; reasoning variant established RL scaling pattern xAI extended into Grok 4
GPT-4.5 OpenAI February 2025 Microsoft Azure capacity Estimated ~10x GPT-4 compute (~2e26 FLOP) Mixed reception sparked industry-wide debate on diminishing returns from pure pretraining scaling
Claude 3.7 Sonnet Anthropic February 2025 AWS and Google Cloud Details not publicly disclosed Anthropic's first hybrid reasoning model; established competitive coding performance vs OpenAI
Gemini 2.5 Pro Google March 2025 Google TPU v5p / Trillium Estimated above 1e25 FLOP Established Gemini as competitive frontier model on long-context and reasoning
Llama 4 Behemoth Meta April 2025 (in training) Meta GPU cluster, 32K GPUs FP8 pretraining >30T training tokens (more than 2x Llama 3); ~2T parameter MoE; achieved 390 TFLOPs/GPU FP8 utilization Meta's largest training run; teacher model for Llama 4 family distillation
Llama 4 Scout / Maverick Meta April 2025 Meta H100 cluster (>100K H100s announced) Scout: 17B active / 109B total; ~40T training tokens. Maverick: 17B active / 402B total; ~22T training tokens First Meta production-grade MoE; 10M context window in Scout; open-weight at frontier capability tier
Claude Opus 4 Anthropic May 2025 AWS Trainium2 (Project Rainier) and Google TPU Details not publicly disclosed First major Anthropic model trained substantially on AWS Trainium; established Project Rainier infrastructure
Grok 4 / Grok 4 Heavy xAI July 2025 Colossus 200K H100 RL training at pretraining scale; xAI scaled RL compute to match pretraining compute; 6x efficiency improvement reported First major run scaling RL compute to match pretraining compute; Grok 4 Heavy variant uses parallel test-time compute with multiple agents
GPT-5 OpenAI Summer 2025 Stargate Abilene + Microsoft Azure Smaller than the trajectory predicted by 100x-per-generation scaling; reflected industry pivot from pretraining-dominated to mixed pretraining/RL/test-time compute Reversed prior 100x scaling trend; established the post-pretraining-scaling era
Claude Opus 4.5 / 4.6 / 4.7 Anthropic 2025-2026 AWS Trainium2 (Project Rainier expansion) and Google TPU Details not publicly disclosed Iterative Opus generation; Project Rainier scaled to multi-hundred-thousand-Trainium2 capacity for the campaign
Grok 5 xAI In training (2026) Colossus 1 + Colossus 2 (combined >1M H100-equivalent); first gigawatt-scale training cluster operational First training run on a gigawatt-class cluster; ongoing as of early 2026 First training run to use gigawatt-class compute infrastructure; targets capabilities including scientific discovery and autonomous engineering per xAI public statements

Inflection points

Inflection Marker What changed
The 1e25 FLOP threshold GPT-4 (March 2023) Established the modern frontier; later codified into EU AI Act regulatory threshold
MoE goes mainstream GPT-4 (leaked architecture); DeepSeek-V3, Llama 4 Sparse activation became the default architecture for cost-efficient scaling
Reasoning models o1 (September 2024); DeepSeek-R1 (January 2025) Test-time compute scaling proved capability gains beyond pretraining-only scaling; reasoning RL became a separate training discipline
100K GPU cluster operational Colossus (built 122 days, summer 2024) Demonstrated that single-cluster builds at 100K GPU scale were achievable in months rather than years
Cost compression DeepSeek-V3 ($5.6M pretraining) Demonstrated frontier capabilities at a fraction of US lab costs; sparked industry reassessment of capital intensity assumptions
Pretraining scaling pause GPT-4.5 reception (February 2025); GPT-5 deviation from 100x trajectory Industry-wide pivot from pure pretraining scaling to mixed pretraining + RL + test-time compute; "pretraining isn't dead but isn't sufficient" became the operational consensus
RL at pretraining scale Grok 4 (July 2025) First major run scaling RL compute to match pretraining compute; new architectural pattern
Gigawatt-class cluster Colossus 2 (operational January 2026) First training cluster crossing 1 GW; established the next infrastructure tier

What the run history reveals

Three patterns run through the training-run timeline. First, compute consumption per run grew roughly 100x per generation from GPT-3 (3.14e23 FLOP) through GPT-4 (2.15e25 FLOP) - then deviated. GPT-4.5 was approximately 10x GPT-4, and GPT-5 reportedly came in below the 100x trajectory entirely. The 100x-per-generation scaling law that defined the 2020-2023 era visibly broke around 2024-2025, replaced by mixed strategies combining pretraining, RL, and test-time compute.

Second, the cost-to-capability ratio compressed dramatically over the same window. GPT-4 cost approximately $63M in compute. DeepSeek-V3 reported approximately $5.6M for comparable benchmark performance two years later. The compression came from architectural innovation (MoE, sparse activation), training efficiency improvements (FP8, better data curation), and hardware progress. The implication is that frontier capability is becoming achievable at smaller compute budgets, even as the absolute compute frontier keeps moving up.

Third, the infrastructure has consolidated around fewer, larger clusters. The early frontier era used multiple smaller clusters across multiple operators. The 2024-2026 era has shifted toward named flagship clusters (Colossus, Stargate Abilene, Meta Hyperion) where billions of dollars of capex concentrate at a single site or a small number of sites. The trend reflects both technical advantages of training at single-site scale and the practical reality that gigawatt-class deployments cannot be replicated at every operator's site simultaneously.


<

Where this fits

This page covers events. The AI Training Superclusters page covers the infrastructure that hosted them. The Sites pillar covers the named campuses that contain those superclusters. The Bottleneck Atlas covers the supply chain dependencies (HBM, CoWoS, GPUs, transformers) that gate which runs can actually happen. Cross-network references run to SX:NVIDIA Spotlight for the silicon side and EX:Nuclear Energy for the power infrastructure that increasingly anchors training cluster siting.


Related coverage

AI Training Superclusters | AI Factory | Sites | Bottleneck Atlas | xAI Colossus | Stargate | Meta Hyperion | Tesla Dojo | SX:NVIDIA Spotlight | SX:HBM | SX:CoWoS