DataCentersX > Compute Ops > SLA & SLO Management


DC SLA & SLO Management


SLA and SLO management is the commitment-and-measurement discipline that translates the underlying operational performance of the data center and its services into contractual promises and measurable targets. The discipline lives in Compute Ops rather than facility operations because SLAs are typically expressed at the service layer (IaaS, SaaS, AI inference, colocation tenancy) rather than the facility infrastructure layer. The underlying resilience that delivers the commitments is operated separately under Resilience & Reliability; the compliance evidence of SLA performance flows to GRC:Compliance. This page covers how SLAs and SLOs are structured, measured, and managed at modern operator scale.


SLA, SLO, SLI

Term What it is Audience
SLA (Service Level Agreement) Contractual commitment to customers; specifies performance, breach consequences, exclusions External; customer-facing
SLO (Service Level Objective) Internal target, typically tighter than SLA; provides operational margin Internal; engineering and operations
SLI (Service Level Indicator) Measurement; the actual performance number against SLO/SLA Internal observability and external dashboards
Error budget Allowed underperformance within SLO; consumed by incidents and feature deployments SRE, engineering management

Uptime targets

Availability Annual downtime allowed Where used
99.9% (three nines) 8.76 hours Single-instance enterprise IT; some single-region cloud services
99.95% (three-and-a-half nines) 4.38 hours Single-region cloud compute (Azure single-instance VM); enterprise SaaS
99.99% (four nines) 52.6 minutes Standard hyperscaler regional services (AWS EC2 region, GCP regional, Azure availability set)
99.999% (five nines) 5.26 minutes Multi-region cloud services; financial services platforms; telecom critical paths
99.9999% (six nines) 31.5 seconds Rarely formal SLA; some defense and trading internal targets; aspirational rather than contracted

Hyperscaler SLA structures

Real hyperscaler SLAs are tiered by service architecture choice. The customer's deployment pattern (single instance, multiple instances in one zone, multiple zones, multiple regions) determines the SLA they receive, with progressive tiers reflecting the resilience that the deployment pattern provides. The structure is consistent across major cloud providers though specific percentages vary.

Service tier Typical SLA Service credit structure
Single VM / instance 99.5-99.9% (varies by provider) 10% of monthly bill credit at first breach band; 25% at second; 100% at severe
Multiple instances in single zone 99.95% common Tiered service credits scaling with breach severity
Multi-AZ deployment 99.99% standard Tiered credits; reduced exclusions
Multi-region deployment 99.99% or higher; some services 99.999% with multi-region active-active Highest credit tier; broadest coverage
Object storage 99.9% availability; 99.999999999% durability ("eleven nines") for stored data Service credits for availability; durability is a separate commitment
DNS 100% availability claim by major providers Service credits triggered on any unavailability

Service credits are the standard remedy for SLA breaches and are typically structured as a percentage of the monthly bill for the affected service. Credits cap at some percentage (often 100%, sometimes lower) regardless of how severe the breach was. Most enterprise customers find the credit value substantially less than the actual cost of an outage, which is why customers with critical workloads architect for resilience above the SLA tier rather than relying on credits for protection.


SLAs by operator class

Operator type Primary SLA scope Distinctive concerns
Retail colocation Facility power and cooling availability; cross-connect uptime Tenant-installed IT is the tenant's responsibility; SLA covers the box, not what's in it
Wholesale colocation Block power, cooling, network infrastructure Hyperscale tenants typically negotiate custom SLAs above provider baseline
Hyperscaler IaaS Compute, storage, network availability per service tier Tiered by deployment pattern; service credits as remedy
Hyperscaler PaaS / SaaS Service-level availability and performance SLAs at higher abstraction; broader scope; complex breach attribution
AI inference services Token throughput, time-to-first-token, p99 latency, availability Latency-sensitive; capacity-constrained; new SLO categories without long history
AI training services (neo-cloud) GPU allocation availability; cluster networking performance Capacity contracts more important than uptime; lengthy outages may be acceptable if amortized over training run
Sovereign cloud Standard cloud SLAs plus residency, access control, citizenship guarantees Compliance commitments often more critical than availability commitments
Enterprise self-hosted Internal SLAs to business units; no external customer SLA Internal "credits" are organizational; politics and accountability rather than financial

AI inference SLOs

AI inference services have introduced SLO categories that traditional cloud computing SLAs didn't address. The metrics that matter for inference are not just availability but also throughput, latency at multiple percentiles, and capacity guarantees - and the SLOs vary substantially by model size, hardware tier, and customer commitment level.

SLO What it measures Why it matters for inference
Time to first token (TTFT) Latency from request to first output token Dominant user-experience metric for chat and streaming inference; sub-second target standard
Inter-token latency / TPOT Time between consecutive output tokens Drives perceived response speed; tens of milliseconds typical target
Throughput (tokens per second) Tokens delivered per second per request or aggregate Capacity planning; per-customer commitment; concurrent request handling
P99 latency 99th percentile request completion time Captures tail latency; matters for time-sensitive applications
Capacity availability Percentage of requests served vs throttled or rejected During high-demand periods; capacity-constrained models often have explicit throttling
Provisioned throughput SLA Reserved capacity commitment for enterprise customers Premium tier; bypasses general capacity throttling

Error budgets and SRE practice

The error budget concept (originated at Google, codified in the SRE book) treats the gap between 100% availability and the SLO as a finite resource that engineering can spend. If the SLO is 99.9% (allowing ~8.76 hours of downtime per year), the team has an "error budget" of those 8.76 hours that they can consume through deployments, feature changes, or incidents. Burning through the error budget triggers a freeze on new deployments until reliability is restored. The framework converts the abstract reliability conversation into a concrete tradeoff between feature velocity and operational stability.

SRE practice has spread well beyond Google. Most hyperscalers, large neo-clouds, and major SaaS providers operate SRE teams with formalized error budget policies. The specific implementations vary - some teams operate strict freezes, others negotiate exceptions, some use error budget as input to release approval processes. The core idea (concrete reliability target as currency for engineering decisions) is now standard practice. Companies like Nobl9, Datadog SLOs, Grafana SLO, and Honeycomb provide tooling; major SLO frameworks include OpenSLO and the SLO definitions in OpenTelemetry.


Measurement and telemetry

SLI accuracy depends on telemetry quality. The infrastructure that produces SLIs spans multiple monitoring layers: Power Monitoring for facility availability evidence, Cooling Monitoring for thermal envelope compliance, Telemetry & Observability for application-layer measurement, Network Operations for network-layer measurement, and customer-side synthetic probes for end-to-end measurement that captures the customer experience rather than the operator's internal view. The relationship between operator-internal and customer-side measurement is itself a source of SLA disputes - the operator's metrics may show SLA compliance while the customer's experience shows breach, often due to network paths or client-side issues that the operator's measurement doesn't capture.


SLA exclusions and force majeure

SLAs typically include exclusions that can substantially limit when service credits actually apply. Standard exclusions include scheduled maintenance (announced in advance), force majeure (natural disasters, pandemic, war), customer-side issues (configuration errors, network problems on the customer's side), DDoS attacks against the customer's specific resources, and third-party service outages (DNS, certificate authorities, upstream providers). The exclusion language is heavily negotiated for enterprise contracts and represents a real source of dispute when major incidents occur. Customer-side legal review of SLA exclusions is a standard part of cloud procurement, and the gap between marketed SLA percentages and effective post-exclusion coverage is often substantial.


SLA reporting and compliance

SLA performance becomes audit evidence in multiple compliance frameworks. SOC 2 expects documented service performance; ISO 27001 covers service management; ISO 20000 specifically covers IT service management with SLA discipline as a core requirement; SEC Climate Disclosure includes operational performance disclosures for material services. Compliance customers (financial services, government, healthcare) often require SLA reporting cadence and auditability above what general-purpose customers receive. The reporting and audit-evidence flow connects to GRC:Compliance and GRC:Auditability; the specific control catalogs that require SLA evidence connect to GRC:Controls.


Where this fits

SLA/SLO management is the commitment-and-measurement discipline operated under Compute Ops. The underlying resilience that delivers commitments is operated under Resilience & Reliability in FACILITY OPS for facility-layer concerns and Compute Ops for IT/application layer. The telemetry that produces SLIs flows from Telemetry & Observability. Predictive operations (preventing breaches before they occur) is the role of AIOps and Production Reliability Engineering. SLA reporting becomes audit evidence in GRC:Compliance and GRC:Auditability. The contractual structure of SLAs and exclusions ties to commercial operations and is part of customer onboarding under Business Models.


Related coverage

Compute Ops | Production Reliability Engineering | AIOps | Telemetry & Observability | Resilience & Reliability | AI Inference | Compliance | Auditability | Controls | Business Models