DataCentersX > Facility Ops > Resilience & Reliability
DC Resilience & Reliability
Resilience and reliability is the operational discipline that ensures the facility and its workloads survive failures and continue delivering services. The discipline operates across two distinct layers that are often conflated. Facility resilience covers physical infrastructure - power, cooling, networking, life safety - with redundancy classified by Uptime Institute Tier and TIA-942 Rated frameworks. IT/application high availability covers compute, storage, and application-layer fault tolerance with cluster failover, replication, and geographic distribution. The two layers operate under different teams, different vendor ecosystems, and different design principles, even though they share the goal of uninterrupted service delivery.
Facility tier classifications
| Tier | Design | Annual availability | Typical use |
|---|---|---|---|
| Tier I / Rated 1 | Basic non-redundant capacity | 99.671% (28.8 hours/year downtime) | Small enterprise; AI training where checkpoints absorb interruption |
| Tier II / Rated 2 | Redundant capacity components, single delivery path | 99.741% (22 hours/year downtime) | Mid-market enterprise; some AI factory builds |
| Tier III / Rated 3 | Concurrently maintainable; redundant capacity and delivery paths | 99.982% (1.6 hours/year downtime) | Standard for hyperscale and colocation; baseline for enterprise SLAs |
| Tier IV / Rated 4 | Fault-tolerant; redundant capacity and delivery paths active simultaneously | 99.995% (26 minutes/year downtime) | Financial services, government, mission-critical enterprise |
The AI factory tradeoff
AI training workloads have created a new resilience calculus that the traditional Tier framework wasn't designed for. Frontier training runs check pointing every few minutes can absorb a 30-minute facility interruption with negligible cost - the run resumes from the last checkpoint with limited compute waste. Inference workloads continue to demand high availability because each interruption affects user experience. The implication is that some operators are deliberately accepting facility-level resilience below traditional Tier III for training-only campuses, trading lower facility capex for higher compute capex, while maintaining Tier III+ for inference and hybrid sites. xAI Colossus, several Stargate sites, and some Meta AI campuses have publicly disclosed this design pattern. The traditional resilience hierarchy has therefore split into a "training-acceptable" tier and an "inference-required" tier with different optimization objectives.
Redundancy schemes
| Scheme | Configuration | Where used |
|---|---|---|
| N | Bare capacity to meet load; no redundancy | Tier I; training-only sites accepting downtime |
| N+1 | One additional capacity unit beyond load requirement | Tier II/III; cooling, generators, UPS modules |
| N+2 | Two additional capacity units | Higher-availability designs; concurrent maintenance with margin |
| 2N | Fully duplicated capacity; either side can carry full load | Tier IV; mission-critical electrical distribution |
| 2N+1 | Duplicated capacity with one extra unit per side | Highest-availability deployments where concurrent maintenance with margin during failure is required |
| Distributed redundancy | Capacity distributed across N units where any K can fail | Modern UPS and PDU architectures; battery strings |
Failure modes that actually matter
Resilience design that focuses on theoretical equipment failure can miss the operational reality that most actual outages come from a small set of recurring failure modes. Modern incident analysis at hyperscale operators consistently identifies the categories below as the dominant operational risks.
| Failure mode | What happens | Mitigation |
|---|---|---|
| Utility power events | Substation faults, transmission events, grid frequency excursions | UPS ride-through; generator backup; behind-the-meter capacity for severe events |
| Cooling failures | Chiller plant trip, CDU failure, refrigerant leak, controls fault | N+1 cooling capacity; thermal mass for ride-through; orderly shutdown sequences |
| Software and configuration errors | BMS, EPMS, orchestration platform, network controller errors causing cascading failures | Change management, blast radius limitation, canary deployments, automated rollback |
| Network fiber cuts | Construction damage to external fiber paths; rare internal cable damage | Diverse fiber entries; multiple carriers; route diversity from at least two BEFs |
| Human error in maintenance | Wrong breaker operated; equipment de-energized incorrectly; pumps left isolated | Method of procedure (MOP) discipline, two-person rule, lockout/tagout, automated interlocks |
| Battery and UPS failures | Battery string failure during transfer; UPS module failure; thermal runaway in extreme cases | Battery monitoring; periodic discharge testing; predictive maintenance; replacement scheduling |
| Environmental events | Flood, wildfire, extreme heat, severe weather, seismic events | Site selection; structural design; geographic diversity for critical workloads |
| Cyber events | Ransomware affecting facility systems, control system compromise, malicious insider | OT-IT segmentation, BMS/EPMS hardening, monitoring; covered in Security pillar |
IT and application high availability
The IT/application HA layer is operationally distinct from facility resilience and is operated under Compute Ops rather than FACILITY OPS, but the disciplines coordinate at incident response.
| Layer | HA technique | Failure handled |
|---|---|---|
| Compute | Cluster failover, live migration, Kubernetes self-healing pods | Single-server failure transparent to workload |
| Storage | RAID, erasure coding, distributed file systems with multi-replica | Disk and node failures handled without data loss |
| Network | Dual-homed paths, redundant switches, ECMP, fast reroute | Switch and link failures rerouted in seconds or sub-second |
| Application | Load balancers, auto-scaling, health checking, canary deployments | Instance failures and bad deployments isolated automatically |
| Data | Multi-region replication, eventual consistency, conflict resolution | Regional outages handled with failover or degraded service |
AI training resilience
AI training is its own resilience discipline because failure of any single GPU or node in a tightly-coupled training job typically halts the entire run. Resilience strategies include checkpointing (saving model state every N minutes; the canonical mitigation that converts hours of failure into minutes of restart), redundant nodes (some training frameworks support hot spares that take over for failed nodes), elastic training (frameworks that can resume on different cluster configurations), and multi-cluster training (splitting a run across multiple data centers, with checkpoint synchronization). The frequency of node failures at 100K+ GPU scale (typically multiple per day) makes resilience a primary engineering concern - the original GPT-4 training reportedly experienced "an absurd number of failures requiring checkpoints that needed to be restarted from." The resilience discipline at training scale is fundamentally different from inference resilience and is engineered into the training stack rather than imposed by facility design.
BCDR (Business Continuity and Disaster Recovery)
BCDR planning extends resilience beyond same-site failure handling to recovery from catastrophic events that affect entire facilities or regions. The discipline produces documented recovery procedures, recovery time objectives (RTO) and recovery point objectives (RPO) for each workload category, secondary site arrangements, and data replication strategies. ISO 22301 provides the international framework for BCDR; financial services and government workloads typically have specific BCDR requirements above the baseline. Modern BCDR practice has shifted from cold-standby DR sites to warm-standby and active-active multi-region deployments, particularly for cloud workloads where the application architecture supports it.
Chaos engineering and continuous resilience testing
Chaos engineering is the discipline of deliberately introducing failures into production systems to verify that resilience actually works. The technique originated at Netflix (Chaos Monkey) and has spread across hyperscalers and major enterprises. Major facility operators run chaos exercises that include actual generator transfers under load, BMS controller failover, network path failures, and selected equipment shutdowns - on production systems, on schedule, with documented procedures. The discipline is operationally important because resilience that has not been tested cannot be assumed to work; many high-profile outages have been caused by failover mechanisms that worked in design but failed in production execution. Companies like Gremlin, Steadybit, AWS Fault Injection Service, and Azure Chaos Studio provide tooling; operator-internal frameworks are common at hyperscalers.
Reliability metrics
| Metric | What it measures | Use |
|---|---|---|
| Availability percentage | Uptime as percentage of total time | SLA / SLO commitment to customers |
| MTBF (Mean Time Between Failures) | Average time between component failures | Equipment lifecycle planning; reliability engineering |
| MTTR (Mean Time To Repair) | Average time to restore service after failure | Operational responsiveness; staffing decisions |
| RTO (Recovery Time Objective) | Maximum acceptable time to restore service after disaster | BCDR planning; tier of recovery infrastructure |
| RPO (Recovery Point Objective) | Maximum acceptable data loss in disaster scenario | Replication strategy; backup frequency |
| Error budget | Allowed downtime within SLO; consumed by incidents | SRE practice; reliability vs feature-velocity tradeoff |
Where this fits
Facility resilience is operated under FACILITY OPS as physical infrastructure design and operational practice. IT and application HA is operated under Compute Ops via Orchestration, Production Reliability Engineering, and Network Operations. BCDR planning sits at the intersection and connects to GRC:Risk Management. The compliance frameworks for resilience (ISO 22301, NIST SP 800-34, FFIEC for financial services) flow to GRC:Compliance. The Tier framework itself is a compliance-relevant design decision under GRC:Controls.
Related coverage
Facility Ops | Compute Ops | Orchestration | Production Reliability Engineering | Network Operations | Life Safety | Fire Suppression | Seismic & Vibration | Risk Management | Compliance | Controls