DataCentersX > Glossary


Data Center Industry Glossary


This analyst-level glossary collects key terms, acronyms, and concepts used across DatacentersX, focusing on facility architecture, cooling, power, AI infrastructure, networking, operations, sustainability, and the supply chain that builds and powers data centers.


2N
Power and cooling redundancy scheme with fully duplicated capacity; either side can carry full load. Standard for Tier IV facilities and mission-critical electrical distribution.

800G optical transceiver
Optical module providing 800 gigabits per second of network bandwidth, common in modern AI training cluster fabric and high-speed inter-cluster links. Supply was structurally constrained through 2024-2025; Coherent, Lumentum, Innolight, Eoptolink, and Accelink dominate. Co-packaged optics is the next evolution.

Active-active
Resilience pattern where multiple sites or systems run simultaneously, both serving load, with automatic failover between them. Increasingly preferred over active-passive for cloud workloads.

AI Factory
A data center purpose-built for AI training or large-scale inference, featuring exceptional GPU density, liquid cooling, high-bandwidth fabric, and multi-hundred-megawatt to multi-gigawatt power. Examples include xAI Colossus, Tesla Cortex, Stargate Abilene, Meta Hyperion. A first-class facility category alongside hyperscaler, HPC, enterprise, and colocation.

AI Inference
The runtime application of trained AI models to produce outputs. Distinct from AI Training. Sub-categories include hyperscale inference, on-prem inference, edge inference, and on-device inference. Has its own SLO categories (TTFT, TPOT, throughput, P99 latency).

AI Training
The compute-intensive process of producing AI models from data. Tightly-coupled, high-bandwidth, multi-GPU workloads run on AI Training Superclusters. Distinct from inference in workload character (batch vs request-driven), latency requirements, and resilience patterns (checkpointing makes training tolerant of facility downtime that inference cannot accept).

AIOps
AI for IT Operations. Machine learning applied to operational telemetry to detect anomalies, predict failures, correlate alerts, and automate remediation. Lives between Telemetry & Observability and Platform Reliability Engineering. Trust gradient runs detection (high-trust) → correlation → prediction → automated remediation (low-trust).

Air-cooled
Cooling architecture using air as the primary heat-removal medium. Adequate to ~30-50 kW per rack with hot/cold aisle containment; insufficient for modern AI training densities.

Anti-passback
Physical security rule that a credential cannot enter a space without first having exited it, preventing one person from using the same badge to admit a second. Enforced by access control systems through reader pairing.

ASHRAE TC 9.9
The American Society of Heating, Refrigerating and Air-Conditioning Engineers Technical Committee 9.9, which publishes thermal guidelines for data center environments. Recommended envelope (18-27°C inlet, 60% RH max) and allowable envelope (15-32°C for class A1) are de facto industry standards.

Availability
Uptime expressed as percentage of total time. Common targets: 99.9% (8.76 hours/year downtime), 99.99% (52.6 minutes/year), 99.999% (5.26 minutes/year), 99.9999% (31.5 seconds/year).

Availability Zone (AZ)
A logically isolated facility (or set of facilities) within a cloud provider's region, with independent power and cooling. Multi-AZ deployment is the standard pattern for higher cloud SLA tiers.

BACnet
Building Automation and Control Networks protocol (ASHRAE/ANSI/ISO 16484-5). Common protocol for BMS integration with sensors and actuators across cooling, lighting, and life safety systems.

Base isolation
Seismic design technique placing flexible bearings between building foundation and superstructure, decoupling the structure from ground motion. Reduces transmitted accelerations 4-10x; common at high-availability facilities in Japan, Taiwan, New Zealand, and selected California sites.

BCDR — Business Continuity and Disaster Recovery
The discipline of recovering services after catastrophic events affecting entire facilities or regions. Includes documented procedures, recovery time objectives (RTO), recovery point objectives (RPO), and secondary site arrangements. ISO 22301 provides international framework.

Behind-the-meter (BTM)
Power generation or storage located on the customer side of the utility meter, bypassing grid interconnection for some or all consumption. Increasingly strategic for AI factories given multi-year grid interconnection queues. Examples include Three Mile Island Unit 1 / Microsoft, Talen Susquehanna / AWS, Fermi Hypergrid SMR roadmap.

BESS — Battery Energy Storage System
Stationary battery installation providing peak shaving, frequency regulation, ride-through, energy arbitrage, or renewable firming. Lithium-ion dominant; flow batteries and emerging chemistries in selected applications. Tesla Megapack, Fluence, BYD, CATL, Wartsila, Saft are major integrators. NFPA 855 governs installation safety in the US.

BGP — Border Gateway Protocol
The internet's primary routing protocol, used between autonomous systems for path advertisement and selection. Modern data center fabrics increasingly run BGP on every host (BGP-on-host) for fast convergence. RPKI and ROV provide BGP route security.

Blackwell (B200, GB200)
NVIDIA's GPU architecture generation succeeding Hopper (H100/H200). GB200 pairs Blackwell GPUs with Grace CPUs in NVL72 racks providing 72-GPU NVLink coherence at rack scale. Production ramp 2024-2026.

Blameless postmortem
Post-incident review focused on systemic and process factors rather than individual blame. Originated at Google; foundational practice in Platform Reliability Engineering. The blameless framing encourages honest disclosure of what actually happened.

BMC — Baseboard Management Controller
Server-integrated controller operating independently of main processors, providing out-of-band remote management including power control, KVM redirect, firmware updates, and hardware health monitoring. Implementations include HPE iLO, Dell iDRAC, Lenovo XClarity, Supermicro IPMI, OpenBMC.

BMS — Building Management System
The integrated platform controlling building HVAC, cooling, lighting, fire, and life safety systems. Operates as the active control layer for facility infrastructure. Common platforms include Honeywell, Johnson Controls, Siemens, Schneider EcoStruxure Building.

Bottleneck Atlas
DatacentersX flagship analytical reference mapping the 15 most concentrated chokepoints in data center buildout, ranked by cross-program leverage. Top constraints are power equipment and grid interconnection rather than compute, reflecting the post-2023 shift from compute-limited to power-limited industry growth.

Bus duct / Busway
Prefabricated electrical distribution system using rigid copper or aluminum conductors in protective enclosure. Used for high-current power distribution to PDUs and large equipment.

Capacity factor
The ratio of actual energy production over time to maximum possible production. Used for renewable resource economics; nuclear typically 90%+, solar 20-30% depending on geography, wind 30-50%.

Carrier hotel
A building (or set of buildings) where many telecom carriers terminate networks specifically for interconnection. Examples: 60 Hudson NYC, 1 Wilshire Los Angeles, Telehouse North London, Equinix TY2 Tokyo, AMS-5 Amsterdam.

Carrier-neutral
Operator stance of providing facility access to multiple carriers without preferential relationships. Standard for retail colocation; differentiates carrier-hotel-style facilities.

CDU — Coolant Distribution Unit
Hydraulic equipment distributing coolant to liquid-cooled racks, separating facility water from rack-level coolant. Critical bottleneck for AI factory deployment; Vertiv expanded CDU capacity 45x in 2024-2025; Schneider acquired Motivair for $850M.

CEMS — Continuous Emissions Monitoring System
Regulatory instrument for emission sources subject to Title V permitting and NSPS requirements. Includes sample probes, conditioning, analyzers (NOx, CO, SO2, CO2, O2), and reporting infrastructure. 40 CFR 75 governs in the US.

CFD — Computational Fluid Dynamics
Engineering simulation technique solving fluid flow and heat transfer equations across discretized space. Foundational technique for data center thermal analysis, used in design-time validation and operational digital twins.

CFE — Carbon-Free Energy
Energy supply matched to carbon-free generation. 24/7 hourly carbon-free matching is the high-integrity standard; Google leading public commitment by 2030. Annual matching is the legacy approach.

Change failure rate
One of the four PRE metrics. Percentage of changes (deployments, configurations) that cause incidents. DORA framework benchmark.

Chaos engineering
Discipline of deliberately introducing failures into production systems to verify resilience actually works. Originated at Netflix (Chaos Monkey); now standard practice at hyperscalers. Tools: Gremlin, Steadybit, AWS Fault Injection Service, Azure Chaos Studio.

Checkpointing
AI training resilience technique saving model state every N minutes, allowing run resumption after node or facility failure with limited compute waste. Enables "Tier I/II acceptable for training" pattern at xAI Colossus, Stargate, and Meta AI campuses.

Chiller
Refrigeration equipment producing chilled water for facility cooling. Major manufacturers include Trane, Carrier, Daikin, Johnson Controls (York), Mitsubishi Electric. Chiller plant trip is one of the dominant facility outage modes.

Clean agent suppression
Fire suppression using gaseous agents that extinguish fires without water damage to equipment. Common agents: FK-5-1-12 (Novec 1230), HFC-227ea (FM-200), inert gas (IG-541, argon, nitrogen). PFAS scrutiny is reshaping agent selection.

Cleanroom (data center context)
Particulate-controlled environment within a data center, typically applied to telecom rooms and electrical equipment areas. Less stringent than semiconductor fab cleanrooms.

Clos topology
Multi-stage switching architecture providing scalable, non-blocking connectivity. Modern data center fabrics use Clos as spine-leaf topology; replaced legacy three-tier (core-aggregation-access) Ethernet design.

Cluster
Within Stack pillar, the layer between rack and facility comprising integrated multi-rack compute systems. NVL72 is a 72-GPU rack-scale cluster; superclusters span thousands to hundreds of thousands of GPUs across multiple racks and rooms.

CMDB — Configuration Management Database
Authoritative inventory of IT assets and their relationships. Often integrated with DCIM for facility-side awareness.

Colocation (colo)
Operator model providing facility infrastructure (power, cooling, space) for tenants who bring their own IT hardware. Sub-types: retail (small footprints, many tenants, carrier neutrality) and wholesale (large blocks, few tenants, hyperscale-focused). Major operators: Equinix, Digital Realty, CyrusOne, QTS, CoreSite.

Concentration test
SiliconPlans network analytical discipline for evaluating claims of supply chain chokepoints, evaluating whether a claimed bottleneck is actually concentrated, genuinely critical, and structurally difficult to substitute.

Containment (hot aisle / cold aisle)
Physical separation of hot and cold air streams in air-cooled data halls, preventing mixing that wastes cooling energy. Containment effectiveness depends on maintained pressure differential, which is leading indicator for capacity issues.

Cortex (Tesla)
Tesla's AI inference compute campus at Giga Texas, supporting FSD fleet operations and humanoid robotics. Co-located with Dojo training. Part of Tesla's vertically integrated silicon-to-deployment AI architecture.

CoWoS — Chip on Wafer on Substrate
TSMC advanced packaging integrating logic dies with HBM memory via silicon interposer. Essential for NVIDIA H100/H200/B200/Rubin and similar AI accelerators. Structural supply chain bottleneck.

Cross-connect
Physical connection between tenants and carriers (or between tenants) within a colocation facility, typically copper or fiber jumper at meet-me room. Cross-connect inventory is core operational data for retail colocation.

CUE — Carbon Usage Effectiveness
CO2 emissions per unit IT energy. Varies with grid carbon intensity; site-specific and time-varying. Companion metric to PUE and WUE.

DCIM — Data Center Infrastructure Management
Integrating platform holding asset inventory, capacity model, change history, and operational telemetry. Distinct from BMS (active cooling control) and EPMS (electrical monitoring); DCIM is the asset and capacity authority. Major platforms: Schneider EcoStruxure IT, Vertiv Trellis, Eaton Brightlayer, Nlyte, Sunbird, FNT, Cadence Future Facilities (CFD-integrated).

DCM — Direct Current Microgrid
Microgrid architecture using DC distribution rather than AC, reducing conversion losses. Emerging pattern in some specialty data center deployments.

DCGM — NVIDIA Data Center GPU Manager
NVIDIA's GPU telemetry and management platform. Provides per-GPU health, ECC error trending, NVLink connectivity verification, thermal monitoring. Foundational tooling for AI training cluster operations.

DDoS — Distributed Denial of Service
Cyber attack pattern overwhelming target with traffic from many sources. Modern attacks regularly exceed 1 Tbps. Mitigation through upstream scrubbing, in-line appliances, cloud-based services (Cloudflare Magic Transit, AWS Shield, Akamai).

Demarc — Demarcation room
Termination point in a facility where carrier responsibility ends and operator responsibility begins. Houses optical distribution frames and cross-connect cabinets.

Dojo (Tesla)
Tesla's AI training campus at Giga Texas. Custom silicon (D1 original, D3/Dojo3 successor) targeting Terafab production. Vertically integrated training-and-deployment architecture.

DORA — DevOps Research and Assessment
Industry research framework defining four (later six) metrics for engineering organization performance: deployment frequency, lead time for changes, change failure rate, MTTR, plus reliability and availability. Standard reporting framework for Platform Reliability Engineering.

DPU — Data Processing Unit
Specialized programmable processor offloading networking, security, and storage functions from main CPUs. NVIDIA BlueField, AMD Pensando (acquired), Intel IPU are major implementations. Increasingly used for management plane in addition to traditional offload.

Dry cooling
Cooling architecture using air-cooled heat exchangers without water consumption. Eliminates water consumption at PUE penalty. Increasingly required in water-stressed jurisdictions.

EBPF — Extended Berkeley Packet Filter
Linux kernel technology allowing observability and security tooling to run in kernel space without kernel modification. Enables deep system visibility with minimal overhead. Major eBPF-native platforms: Cilium, Pixie, Tetragon.

Edge data center
Small-footprint data center deployed close to data sources or end users, supporting low-latency applications. Often modular, lights-out, with smart-hands on-call rather than continuous staffing.

EMS — Energy Management System
Control platform orchestrating energy portfolio in real time: DER dispatch, BESS charging/discharging, microgrid islanding, market participation. Distinct from EPMS. Vendors include Siemens Spectrum Power, GE Grid Solutions, Schneider EcoStruxure Microgrid Advisor.

EPMS — Electrical Power Monitoring System
Real-time monitoring platform for in-facility electrical infrastructure: UPS health, PDU loading, transformer telemetry, arc flash risk. Distinct from EMS (which orchestrates energy portfolio outside the facility). Vendors include Schneider PowerLogic, Eaton Foreseer, ABB Ability EPMS.

Error budget
The gap between 100% availability and the SLO target, treated as a finite engineering resource. Originated at Google. Burning through the error budget triggers release freeze. Converts reliability into concrete tradeoff with feature velocity.

EU AI Act
European Union regulation establishing risk-tiered AI compliance framework. General-Purpose AI Models with Systemic Risk threshold is 1e25 FLOP cumulative training compute. Drives sovereignty and localization concerns for AI-hosting operators.

EU CSRD — Corporate Sustainability Reporting Directive
EU mandatory ESG disclosure framework for large operators. Drives sustainability reporting infrastructure investment.

Evaporative cooling
Cooling technique using water evaporation for heat rejection. Energy-efficient but water-intensive. Adiabatic and direct evaporative variants common in modern data centers.

FedRAMP
Federal Risk and Authorization Management Program. US federal cloud authorization at Low, Moderate, High, and Tailored levels. Required for federal agency cloud procurement.

FFU — Fan Filter Unit
Combined fan and filter assembly used in cleanroom and electrical room ventilation. Provides controlled airflow with particulate filtration.

Five nines
99.999% availability (5.26 minutes/year downtime). Common target for multi-region cloud services, financial services platforms, telecom critical paths.

Floor PDU
Power Distribution Unit at floor level, distributing power from electrical room to row PDUs and rack PDUs. Distinguished from rack PDU (rPDU) at the rack level.

Free cooling
Cooling mode using outside air or water without active refrigeration when ambient conditions allow. Extends effective cooling capacity and reduces compressor energy. Climate-dependent feasibility.

Gang scheduling
Workload scheduling pattern requiring all-or-nothing placement of multi-node jobs. Essential for AI training where partial placement wastes capacity. Native to Slurm and HPC schedulers; requires extensions (Volcano, Kueue, Run:ai) for Kubernetes.

GitOps
Operational pattern where desired state of deployed infrastructure is held in version-controlled repositories and orchestrators continuously reconcile running state to match. Argo CD and Flux CD are dominant Kubernetes platforms.

Greenfield
New data center construction on previously undeveloped or non-data-center land. Distinguished from brownfield (existing site repurposing).

Grid-tie
Utility interconnection connecting facility load to bulk power system. Includes substations, transmission tie-ins, PPA contracts. Interconnection queue length is primary growth constraint; PJM 8-year horizon for new generation.

HBM — High Bandwidth Memory
Stacked DRAM with through-silicon-via interconnect providing substantially higher bandwidth than DDR-class DRAM. Essential for AI accelerators. Production dominated by SK Hynix, Samsung, Micron. Sold out through 2026 at SK Hynix.

HPC — High Performance Computing
Tightly-coupled scientific computing for simulation, modeling, and research. Distinct workload character from AI training (similar tight coupling but different programming models). Examples: Oak Ridge Frontier, El Capitan, scientific clusters.

Hot/Cold Aisle
Air-cooled rack arrangement with alternating cold (supply) and hot (return) aisles. Containment systems separate the two airstreams.

Hot site / Warm site / Cold site
BCDR site classification by readiness. Hot site: fully equipped and ready to take over immediately. Warm site: equipped but not running. Cold site: facility space ready but no equipment installed.

Hyperscaler
Cloud provider operating at extreme scale (typically multi-million-server fleet across global regions). AWS, Microsoft Azure, Google Cloud, Meta in the West; Alibaba, Tencent, Baidu in China. Operator-class with distinct DCIM, networking, and operational requirements.

iDRAC — Integrated Dell Remote Access Controller
Dell's BMC implementation. iDRAC 9 and 10 generations in current production.

iLO — Integrated Lights-Out
HPE's BMC implementation. iLO 5 and iLO 6 generations in current use.

Immersion cooling
Cooling architecture submerging IT equipment in dielectric fluid. Single-phase uses synthetic dielectric or mineral oil; two-phase uses engineered fluorocarbons that boil and condense. Highest density cooling option. PFAS regulatory concerns affect two-phase deployment.

InfiniBand
High-bandwidth, low-latency interconnect dominant for AI training fabric. NVIDIA Quantum-2 (400G) and Quantum-X (800G) are current generations. Native RDMA, mature ecosystem at HPC and AI scale.

Internet Exchange Point (IXP)
Layer-2 fabric inside major carrier hotels allowing participants to peer directly without paying for transit. Major IXPs: DE-CIX (Frankfurt, world's largest by traffic), AMS-IX (Amsterdam), LINX (London), JPNAP (Tokyo), Equinix Internet Exchange (multi-region).

ISO 22301
International standard for business continuity management. Foundation framework for BCDR programs.

ISO 27001
International information security management system standard. Foundational baseline for most data center security compliance programs.

ITIC / CBEMA curve
Information Technology Industry Council voltage tolerance curve defining acceptable voltage deviation envelope for IT equipment. Reference for power quality compliance.

JWCC — Joint Warfighting Cloud Capability
US DoD cloud procurement vehicle (~$9B) succeeding JEDI. Covers FedRAMP and DoD Impact Level requirements.

Lights-out
Operational model where facility runs without continuous on-site staff. Spectrum from skeleton crew (minimal staff for physical work) to fully lights-out (no human presence except for installation and major maintenance).

Liquid cooling
Cooling architecture using liquid (water-glycol or specialty coolant) as primary heat removal medium. Categories: rear-door heat exchangers, direct-to-chip, immersion. Required at AI training densities (50+ kW per rack).

MMR — Meet-me Room
Neutral interconnection space within colocation facility where tenants and carriers cross-connect. MMR interconnection density is primary value driver in retail colocation.

MTBF — Mean Time Between Failures
Average time between component failures. Used for equipment lifecycle planning and reliability engineering.

MTTR — Mean Time To Recovery
Average time from incident detection to service restoration. One of the four PRE metrics. DORA framework benchmark.

N+1, N+2
Power and cooling redundancy schemes adding one or two capacity units beyond minimum load requirement. Standard for Tier II and III facilities.

NEM — Net Energy Metering
Utility billing arrangement crediting customer-generated electricity exported to grid. Used for some onsite solar arrangements.

NERC CIP
North American Electric Reliability Corporation Critical Infrastructure Protection standards. Cybersecurity framework for bulk electric system; applies to grid-adjacent compute and behind-the-meter coupling.

NFPA 855
National Fire Protection Association Standard for Installation of Stationary Energy Storage Systems. Governs BESS siting, separation, ventilation, gas detection, and suppression in the US.

NIS2 Directive
EU Network and Information Security Directive (second iteration). Cybersecurity requirements for essential and important entities including data centers.

NOC — Network Operations Center
Centralized monitoring and operations facility, typically with multi-screen monitoring walls and 24/7 staffing. Hyperscalers operate global NOCs covering hundreds of facilities.

NVL72
NVIDIA reference design connecting 72 Blackwell GPUs in a single rack-scale NVLink Switch domain with memory coherence. Defines new operational paradigm beyond traditional 8-GPU server scale.

NVLink Switch
NVIDIA's high-bandwidth GPU-to-GPU interconnect. Provides memory-coherent access across multi-GPU domains. NVL72 is the rack-scale implementation.

OCP — Open Compute Project
Open hardware design community founded by Facebook (Meta). Publishes specifications for hyperscaler-style server, rack, and facility hardware. Drives standardization that smaller operators benefit from.

OOB — Out-of-Band Management
Network and access path operating independently of primary operational network, providing remote control even when primary network fails. Critical for recovery from network failures without on-site intervention.

OpenTelemetry
CNCF project providing vendor-neutral observability instrumentation for metrics, logs, and traces. OTLP is the standard wire protocol. Decouples instrumentation from observability platform.

Operator pattern (Kubernetes)
Pattern extending Kubernetes with workload-specific automation via custom controllers. Major examples: Postgres operators, Cassandra Operator, Strimzi (Kafka), NVIDIA GPU Operator, Run:ai. Dominant approach for stateful workloads on Kubernetes.

P99 latency
99th percentile request completion time. Captures tail latency that average doesn't reveal. Critical SLO metric for inference and customer-facing services.

PDU — Power Distribution Unit
Equipment distributing electrical power from upstream source to downstream loads. Floor PDU distributes to row level; rack PDU (rPDU) distributes to individual servers within a rack.

Peering
Network arrangement where two networks exchange traffic directly without paying transit. Bilateral peering and IXP-mediated peering are common patterns.

PFAS — Per- and polyfluoroalkyl substances
Class of chemicals subject to intensifying regulatory scrutiny. Affects firefighting foam (AFFF transition to fluorine-free), clean agent suppression (FK-5-1-12 / Novec 1230 under scrutiny), and immersion cooling (two-phase fluorocarbon dielectrics).

PJM
Pennsylvania-Jersey-Maryland Interconnection. Largest US regional transmission organization by load served. 8-year horizon for new generation; 130+ GW pre-2024 queue. 2026/27 capacity auction cleared at FERC cap of $333.44/MW-day.

Postmortem
Post-incident review documenting timeline, contributing factors, what went well, what didn't, and action items. Blameless framing is standard PRE practice.

PPA — Power Purchase Agreement
Long-term contract between energy buyer and generator. Physical PPAs deliver electricity to the grid where buyer consumes; virtual (VPPA) settle financially against market prices. Hyperscaler-scale PPAs increasingly include nuclear (TMI/Microsoft) and SMR commitments.

PRR — Production Readiness Review
Gate that new services pass before reaching production. Covers SLO definition, observability, runbooks, capacity planning, security, on-call coverage. Standard PRE practice.

PUE — Power Usage Effectiveness
Total facility power divided by IT equipment power. Standard data center efficiency metric. Legacy enterprise 1.5-2.0; modern hyperscale 1.2-1.3; liquid-cooled AI 1.05-1.15.

Rack
The atomic deployment unit of the data center industry. Standard 42U or 48U height; modern AI training racks (NVL72) at 72 GPUs and 130+ kW per rack redefine rack-scale economics.

Rack PDU (rPDU)
Power distribution at rack level providing per-outlet metering and control. Vendors: Vertiv Geist, Server Technology, Raritan, Eaton ePDU, APC by Schneider.

RDIMM / LRDIMM / UDIMM
DDR memory module variants. RDIMM (registered) and LRDIMM (load-reduced) for servers; UDIMM (unbuffered) for desktop and lower-end systems.

RDMA — Remote Direct Memory Access
Network capability enabling direct memory-to-memory transfer between nodes without CPU involvement. InfiniBand and RoCE provide RDMA; AWS EFA is RDMA-equivalent.

Reclaim (cloud)
Cloud provider event reclaiming spot/preemptible capacity. Spot-tolerant workloads include training with checkpointing, batch jobs that can resume.

Region (cloud)
Geographic deployment of cloud infrastructure containing one or more Availability Zones. Multi-region deployment is standard pattern for highest cloud SLA tiers.

RoT — Root of Trust
Hardware-anchored security foundation providing cryptographic attestation, secure boot, key storage. Implementations: AWS Nitro, Google Titan, Microsoft Pluton, Apple Secure Enclave, TPM modules.

RPKI — Resource Public Key Infrastructure
Cryptographic certificate framework for BGP route security. ROV (Route Origin Validation) filters unauthenticated BGP announcements. Standard mitigation against route hijacks.

RPO — Recovery Point Objective
Maximum acceptable data loss in disaster scenario. Drives replication strategy and backup frequency.

RTO — Recovery Time Objective
Maximum acceptable time to restore service after disaster. Drives BCDR site readiness tier.

Rubin (NVIDIA)
NVIDIA's GPU architecture generation succeeding Blackwell. R100 specifications include 288 GB HBM4 per GPU at 22 TB/s aggregate bandwidth. Production start Q1 2026; reference designs entering customer hands.

Run:ai
AI workload scheduler originally Israeli startup, acquired by NVIDIA in 2024. GPU-focused fractional and shared scheduling; preemptive multi-tenancy. Production AI training scheduler.

Runbook
Documented procedure for handling specific operational scenarios. Foundational tooling for incident response and operations consistency.

SecNumCloud
French ANSSI cloud security certification. Baseline for French public sector procurement; underpins Bleu and S3NS sovereign clouds.

Service mesh
Infrastructure layer handling service-to-service communication: traffic management, security (mTLS), observability, resilience patterns. Major platforms: Istio, Linkerd, Consul Connect.

Skeleton crew
Minimal on-site staffing model where facility operations are handled remotely with limited local presence for physical work. Common at hyperscale facilities.

SLA — Service Level Agreement
Contractual performance commitment with service credit consequences for breach. Hyperscaler SLAs are tiered by deployment architecture (single instance, multi-AZ, multi-region).

SLI — Service Level Indicator
Actual performance measurement against SLO/SLA. Comes from observability platforms.

SLO — Service Level Objective
Internal performance target, typically tighter than SLA. Provides operational margin. Basis for error budget calculation.

SMR — Small Modular Reactor
Nuclear reactor design at <300 MW per unit, factory-fabricated and transported to site. Vendors: Oklo, X-energy, Kairos, Westinghouse, GE Hitachi, NuScale. Hyperscaler PPAs targeting 2028-2032 commercial operation.

SOC 2
Service Organization Controls report covering Trust Services Criteria (security, availability, processing integrity, confidentiality, privacy). Type II is the most requested by enterprise customers.

Sonic
Open-source network operating system for switch hardware, originated at Microsoft, now Linux Foundation project. Disaggregated NOS approach popular at hyperscalers.

Sovereign cloud
Cloud arrangement ensuring data and workloads remain under national jurisdiction. Examples: Bleu (France), Delos (Germany), S3NS (France), Khazna/G42 (UAE), Stargate UAE.

Spectrum-X Ethernet
NVIDIA's standards-based 800G Ethernet fabric for AI training. Used at xAI Colossus 100K-200K H100 fabric. Achieves 95% data throughput vs ~60% for standard Ethernet.

Spine-leaf
Modern data center fabric topology (Clos network) replacing legacy three-tier core-aggregation-access design. Every leaf connects to every spine; predictable latency and bandwidth.

Spot pricing / Preemptible
Cloud capacity at low cost that can be reclaimed by provider on short notice. 40-90% savings off on-demand pricing for spot-tolerant workloads.

SRE — Site Reliability Engineering
Operational discipline originated at Google applying software engineering practices to operations. Foundational to Platform Reliability Engineering.

Stargate
OpenAI / Oracle / SoftBank multi-gigawatt AI training program. Anchor site at Abilene, TX with multi-site expansion announced. Largest publicly-announced AI infrastructure buildout.

Substation
Electrical facility transforming voltage between transmission and distribution levels. Modern AI factories often have dedicated substations connecting to high-voltage transmission directly.

Supercluster (AI training)
Named, integrated GPU compute fabric purpose-built for training frontier foundation models. Distinguished by extreme accelerator density (10K-200K+ GPUs), specialized interconnect, and engineered power/cooling envelope. Examples: Colossus, Stargate Abilene, Meta Hyperion.

Telemetry
Operational data emitted by infrastructure for monitoring and analysis. Modern practice organizes around three pillars: metrics, logs, traces (plus events as fourth category).

Tier (Uptime Institute)
Facility availability classification I-IV. Tier I: basic non-redundant (28.8 hr/yr downtime). Tier II: redundant capacity components (22 hr/yr). Tier III: concurrently maintainable (1.6 hr/yr). Tier IV: fault tolerant (26 min/yr). AI training operators increasingly accept below Tier III for training-only sites.

TPM — Trusted Platform Module
Hardware security component providing cryptographic services, platform attestation, key storage per TCG specifications.

TPU — Tensor Processing Unit
Google's custom AI accelerator. TPU v4, v5e, v5p, v6 (Trillium), v7 generations. Pod-based architecture with proprietary ICI interconnect. Used for Google internal training and Anthropic via Google Cloud.

TPOT — Time Per Output Token
Inference SLO measuring time between consecutive output tokens. Drives perceived response speed for chat and streaming inference. Tens of milliseconds typical target.

Trainium2 (AWS)
AWS's second-generation custom AI training accelerator. Project Rainier (Anthropic partnership) targets 1M+ Trainium2 chips. Pairs with AWS EFA fabric.

TTFT — Time to First Token
Inference SLO measuring latency from request to first output token. Dominant user-experience metric for chat and streaming inference. Sub-second target standard.

Two-person rule
Physical security policy requiring two distinct credentials to access certain spaces. Often with constraint that two people must be different roles or organizations. Standard for sensitive halls, government workloads, vault rooms.

UPS — Uninterruptible Power Supply
Equipment providing battery-backed power during utility outages. Modern UPS at 96-99% efficiency. Battery monitoring (BTECH, Albér, EnerSys, Eagle Eye) is highest-value targeted application of facility AIOps.

vCPU
Virtual CPU. Standard cloud unit for compute capacity. Maps to physical CPU thread or fraction depending on cloud provider configuration.

VESDA — Very Early Smoke Detection Apparatus
Air-sampling smoke detection system providing incipient-stage fire detection in data halls. Major platforms: Honeywell Xtralis (original VESDA), Siemens FDA, Wagner Titanus. Standard primary detection in modern data halls.

VPPA — Virtual Power Purchase Agreement
Financial PPA for renewable energy in different region than facility consumption. Settled against market price. Used where physical PPA not available.

VXLAN — Virtual Extensible LAN
Network virtualization protocol providing Layer-2 over Layer-3. EVPN-VXLAN is standard pattern for multi-tenancy on data center fabric underlay.

WAN — Wide Area Network
Network spanning broad geographic areas, typically connecting data centers to the internet and to each other. Hyperscaler-operated WANs (Google B4, Meta Express Backbone, Microsoft WAN, AWS global backbone) are private global networks.

WUE — Water Usage Effectiveness
Annual site water usage divided by IT energy. Wet-cooled 1.0-2.0 L/kWh; hybrid 0.2-0.5 L/kWh; dry-cooled near zero with PUE penalty. Companion metric to PUE.

xAI Colossus
xAI's Memphis, TN AI training supercluster. Built in 122 days at 100K H100; expanded to 200K; public roadmap to 1M+ GPUs. Colossus 2 first gigawatt-class training cluster operational January 2026.

Zero Trust
Security architecture replacing perimeter-based trust with continuous verification of every access request. Beyondcorp at Google was the original implementation; standard pattern for modern remote access.

Cross-network glossaries

SemiconductorX Glossary covers the chip and fab terminology that anchors the upstream side of the data center supply chain. ElectronsX Glossary covers electrification, EV, and grid terminology relevant to data center energy sourcing and behind-the-meter coupling.