DataCentersX > Frontier Training Runs
Frontier AI Training Runs
Frontier training runs are the events through which the largest AI infrastructure in the world gets used. A training run is time-bounded - it has a start, a duration, a compute consumption, and an outcome (a model with measurable capabilities). Unlike the superclusters that host them, training runs are not enduring entities; they happen, consume their compute budget, produce their model, and end. The cluster persists; the run does not.
This page tracks the major training runs of the foundation model era - their compute, parameters, training tokens, host cluster, and capability outcomes. The companion AI Training Superclusters page covers the infrastructure that hosted these runs. New runs get added as they conclude and as details become public. Many training run details remain undisclosed by their operators; this page documents what is publicly known and clearly labels estimates where official figures are not available.
The 1e25 FLOP era
GPT-4 was the first model trained above 1e25 FLOP, a threshold that has since become the industry rough boundary for "frontier" training. As of mid-2025, more than 30 publicly announced models from at least 12 developers have crossed this threshold, with new models added at a rate of roughly two per month through 2024. The 1e25 FLOP threshold also defines the regulatory boundary for the EU AI Act's General-Purpose AI Model with Systemic Risk classification, which adds compliance overhead for models above this scale.
The runs
| Model | Operator | Released | Cluster / Hardware | Training compute | Notes |
|---|---|---|---|---|---|
| GPT-3 | OpenAI | June 2020 | Microsoft Azure V100 cluster | ~3.14e23 FLOP; 175B parameters; ~300B training tokens | First model to demonstrate emergent capabilities at scale; established the foundation model paradigm |
| PaLM | April 2022 | Two TPU v4 pods, ~6,144 TPU v4 chips | ~2.5e24 FLOP; 540B dense parameters; 780B tokens | First publicly disclosed run on TPU v4 pod scale; demonstrated cross-pod training feasibility | |
| GPT-4 | OpenAI | March 2023 | Microsoft Azure A100 cluster, ~25K A100s | ~2.15e25 FLOP; 90-100 days; ~$63M hardware cost; reported as 1.8T parameter MoE with ~280B active (per leaked SemiAnalysis details); 13T training tokens | First model crossing 1e25 FLOP threshold; first widely deployed MoE foundation model; defined the modern frontier era |
| Gemini Ultra (1.0) | December 2023 | Google TPU v4 / v5e pods | Estimated above 1e25 FLOP; multimodal natively from training | First Google response at GPT-4 capability tier; multimodal native architecture | |
| Claude 3 Opus | Anthropic | March 2024 | AWS and Google Cloud (TPU and GPU) | Estimated above 1e25 FLOP; details not publicly disclosed | Anthropic's first frontier-tier model; established Claude as a frontier competitor |
| Llama 3 405B | Meta | July 2024 | Meta Research SuperCluster H100 fleet (~16K H100s) | ~3.8e25 FLOP; 405B dense parameters; 15T training tokens; 54-day cumulative pretraining | Largest open-weight model at release; established Llama 3 as competitive with closed frontier models |
| GPT-4o | OpenAI | May 2024 | Microsoft Azure H100 capacity | Comparable to GPT-4 scale; native multimodal training | First OpenAI native multimodal model; voice and vision integrated from training |
| Claude 3.5 Sonnet | Anthropic | June 2024 | AWS and Google Cloud | Details not publicly disclosed | Demonstrated that smaller-than-Opus models could match or exceed frontier capability through training improvements |
| Grok 2 | xAI | August 2024 | Colossus 100K H100 | Estimated above 1e25 FLOP; first major model trained on Colossus | First xAI model at frontier capability tier |
| o1 / o1-preview | OpenAI | September 2024 | Microsoft Azure | Test-time compute scaling rather than pure pretraining scaling | First widely deployed reasoning model; established that capability gains could come from inference-time compute, not just pretraining |
| Gemini 2.0 | December 2024 | Google TPU v5p / Trillium | Estimated above 1e25 FLOP; multimodal with agentic capabilities | Google's response to o1 and o3; agentic capabilities natively integrated | |
| DeepSeek-V3 | DeepSeek | December 2024 | ~2,048 H800 GPUs | ~5.6e24 FLOP; 671B MoE (37B active); ~$5.6M reported pretraining cost | Demonstrated dramatic cost reduction in frontier training; sparked industry-wide reassessment of compute efficiency |
| DeepSeek-R1 | DeepSeek | January 2025 | DeepSeek H800 cluster | Reasoning-focused training on top of V3 base | First open-weight reasoning model at frontier capability tier; competitive with o1 on reasoning benchmarks |
| Grok 3 / Grok 3 Reasoning | xAI | February 2025 | Colossus 100K-200K H100 | Pretraining at unprecedented scale per xAI; reasoning variant trained with reinforcement learning | Leveraged Colossus expansion to 200K GPUs; reasoning variant established RL scaling pattern xAI extended into Grok 4 |
| GPT-4.5 | OpenAI | February 2025 | Microsoft Azure capacity | Estimated ~10x GPT-4 compute (~2e26 FLOP) | Mixed reception sparked industry-wide debate on diminishing returns from pure pretraining scaling |
| Claude 3.7 Sonnet | Anthropic | February 2025 | AWS and Google Cloud | Details not publicly disclosed | Anthropic's first hybrid reasoning model; established competitive coding performance vs OpenAI |
| Gemini 2.5 Pro | March 2025 | Google TPU v5p / Trillium | Estimated above 1e25 FLOP | Established Gemini as competitive frontier model on long-context and reasoning | |
| Llama 4 Behemoth | Meta | April 2025 (in training) | Meta GPU cluster, 32K GPUs FP8 pretraining | >30T training tokens (more than 2x Llama 3); ~2T parameter MoE; achieved 390 TFLOPs/GPU FP8 utilization | Meta's largest training run; teacher model for Llama 4 family distillation |
| Llama 4 Scout / Maverick | Meta | April 2025 | Meta H100 cluster (>100K H100s announced) | Scout: 17B active / 109B total; ~40T training tokens. Maverick: 17B active / 402B total; ~22T training tokens | First Meta production-grade MoE; 10M context window in Scout; open-weight at frontier capability tier |
| Claude Opus 4 | Anthropic | May 2025 | AWS Trainium2 (Project Rainier) and Google TPU | Details not publicly disclosed | First major Anthropic model trained substantially on AWS Trainium; established Project Rainier infrastructure |
| Grok 4 / Grok 4 Heavy | xAI | July 2025 | Colossus 200K H100 | RL training at pretraining scale; xAI scaled RL compute to match pretraining compute; 6x efficiency improvement reported | First major run scaling RL compute to match pretraining compute; Grok 4 Heavy variant uses parallel test-time compute with multiple agents |
| GPT-5 | OpenAI | Summer 2025 | Stargate Abilene + Microsoft Azure | Smaller than the trajectory predicted by 100x-per-generation scaling; reflected industry pivot from pretraining-dominated to mixed pretraining/RL/test-time compute | Reversed prior 100x scaling trend; established the post-pretraining-scaling era |
| Claude Opus 4.5 / 4.6 / 4.7 | Anthropic | 2025-2026 | AWS Trainium2 (Project Rainier expansion) and Google TPU | Details not publicly disclosed | Iterative Opus generation; Project Rainier scaled to multi-hundred-thousand-Trainium2 capacity for the campaign |
| Grok 5 | xAI | In training (2026) | Colossus 1 + Colossus 2 (combined >1M H100-equivalent); first gigawatt-scale training cluster operational | First training run on a gigawatt-class cluster; ongoing as of early 2026 | First training run to use gigawatt-class compute infrastructure; targets capabilities including scientific discovery and autonomous engineering per xAI public statements |
Inflection points
| Inflection | Marker | What changed |
|---|---|---|
| The 1e25 FLOP threshold | GPT-4 (March 2023) | Established the modern frontier; later codified into EU AI Act regulatory threshold |
| MoE goes mainstream | GPT-4 (leaked architecture); DeepSeek-V3, Llama 4 | Sparse activation became the default architecture for cost-efficient scaling |
| Reasoning models | o1 (September 2024); DeepSeek-R1 (January 2025) | Test-time compute scaling proved capability gains beyond pretraining-only scaling; reasoning RL became a separate training discipline |
| 100K GPU cluster operational | Colossus (built 122 days, summer 2024) | Demonstrated that single-cluster builds at 100K GPU scale were achievable in months rather than years |
| Cost compression | DeepSeek-V3 ($5.6M pretraining) | Demonstrated frontier capabilities at a fraction of US lab costs; sparked industry reassessment of capital intensity assumptions |
| Pretraining scaling pause | GPT-4.5 reception (February 2025); GPT-5 deviation from 100x trajectory | Industry-wide pivot from pure pretraining scaling to mixed pretraining + RL + test-time compute; "pretraining isn't dead but isn't sufficient" became the operational consensus |
| RL at pretraining scale | Grok 4 (July 2025) | First major run scaling RL compute to match pretraining compute; new architectural pattern |
| Gigawatt-class cluster | Colossus 2 (operational January 2026) | First training cluster crossing 1 GW; established the next infrastructure tier |
What the run history reveals
Three patterns run through the training-run timeline. First, compute consumption per run grew roughly 100x per generation from GPT-3 (3.14e23 FLOP) through GPT-4 (2.15e25 FLOP) - then deviated. GPT-4.5 was approximately 10x GPT-4, and GPT-5 reportedly came in below the 100x trajectory entirely. The 100x-per-generation scaling law that defined the 2020-2023 era visibly broke around 2024-2025, replaced by mixed strategies combining pretraining, RL, and test-time compute.
Second, the cost-to-capability ratio compressed dramatically over the same window. GPT-4 cost approximately $63M in compute. DeepSeek-V3 reported approximately $5.6M for comparable benchmark performance two years later. The compression came from architectural innovation (MoE, sparse activation), training efficiency improvements (FP8, better data curation), and hardware progress. The implication is that frontier capability is becoming achievable at smaller compute budgets, even as the absolute compute frontier keeps moving up.
Third, the infrastructure has consolidated around fewer, larger clusters. The early frontier era used multiple smaller clusters across multiple operators. The 2024-2026 era has shifted toward named flagship clusters (Colossus, Stargate Abilene, Meta Hyperion) where billions of dollars of capex concentrate at a single site or a small number of sites. The trend reflects both technical advantages of training at single-site scale and the practical reality that gigawatt-class deployments cannot be replicated at every operator's site simultaneously.
<
Where this fits
This page covers events. The AI Training Superclusters page covers the infrastructure that hosted them. The Sites pillar covers the named campuses that contain those superclusters. The Bottleneck Atlas covers the supply chain dependencies (HBM, CoWoS, GPUs, transformers) that gate which runs can actually happen. Cross-network references run to SX:NVIDIA Spotlight for the silicon side and EX:Nuclear Energy for the power infrastructure that increasingly anchors training cluster siting.
Related coverage
AI Training Superclusters | AI Factory | Sites | Bottleneck Atlas | xAI Colossus | Stargate | Meta Hyperion | Tesla Dojo | SX:NVIDIA Spotlight | SX:HBM | SX:CoWoS