DataCentersX > Compute Ops > SLA & SLO Management

DC SLA & SLO Management

SLA and SLO management is the commitment-and-measurement discipline that translates the underlying operational performance of the data center and its services into contractual promises and measurable targets. The discipline lives in Compute Ops rather than facility operations because SLAs are typically expressed at the service layer (IaaS, SaaS, AI inference, colocation tenancy) rather than the facility infrastructure layer. The underlying resilience that delivers the commitments is operated separately under Resilience & Reliability; the compliance evidence of SLA performance flows to GRC:Compliance. This page covers how SLAs and SLOs are structured, measured, and managed at modern operator scale.

SLA, SLO, SLI

Term	What it is	Audience
SLA (Service Level Agreement)	Contractual commitment to customers; specifies performance, breach consequences, exclusions	External; customer-facing
SLO (Service Level Objective)	Internal target, typically tighter than SLA; provides operational margin	Internal; engineering and operations
SLI (Service Level Indicator)	Measurement; the actual performance number against SLO/SLA	Internal observability and external dashboards
Error budget	Allowed underperformance within SLO; consumed by incidents and feature deployments	SRE, engineering management

Uptime targets

Availability	Annual downtime allowed	Where used
99.9% (three nines)	8.76 hours	Single-instance enterprise IT; some single-region cloud services
99.95% (three-and-a-half nines)	4.38 hours	Single-region cloud compute (Azure single-instance VM); enterprise SaaS
99.99% (four nines)	52.6 minutes	Standard hyperscaler regional services (AWS EC2 region, GCP regional, Azure availability set)
99.999% (five nines)	5.26 minutes	Multi-region cloud services; financial services platforms; telecom critical paths
99.9999% (six nines)	31.5 seconds	Rarely formal SLA; some defense and trading internal targets; aspirational rather than contracted

Hyperscaler SLA structures

Real hyperscaler SLAs are tiered by service architecture choice. The customer's deployment pattern (single instance, multiple instances in one zone, multiple zones, multiple regions) determines the SLA they receive, with progressive tiers reflecting the resilience that the deployment pattern provides. The structure is consistent across major cloud providers though specific percentages vary.

Service tier	Typical SLA	Service credit structure
Single VM / instance	99.5-99.9% (varies by provider)	10% of monthly bill credit at first breach band; 25% at second; 100% at severe
Multiple instances in single zone	99.95% common	Tiered service credits scaling with breach severity
Multi-AZ deployment	99.99% standard	Tiered credits; reduced exclusions
Multi-region deployment	99.99% or higher; some services 99.999% with multi-region active-active	Highest credit tier; broadest coverage
Object storage	99.9% availability; 99.999999999% durability ("eleven nines") for stored data	Service credits for availability; durability is a separate commitment
DNS	100% availability claim by major providers	Service credits triggered on any unavailability

Service credits are the standard remedy for SLA breaches and are typically structured as a percentage of the monthly bill for the affected service. Credits cap at some percentage (often 100%, sometimes lower) regardless of how severe the breach was. Most enterprise customers find the credit value substantially less than the actual cost of an outage, which is why customers with critical workloads architect for resilience above the SLA tier rather than relying on credits for protection.

SLAs by operator class

Operator type	Primary SLA scope	Distinctive concerns
Retail colocation	Facility power and cooling availability; cross-connect uptime	Tenant-installed IT is the tenant's responsibility; SLA covers the box, not what's in it
Wholesale colocation	Block power, cooling, network infrastructure	Hyperscale tenants typically negotiate custom SLAs above provider baseline
Hyperscaler IaaS	Compute, storage, network availability per service tier	Tiered by deployment pattern; service credits as remedy
Hyperscaler PaaS / SaaS	Service-level availability and performance	SLAs at higher abstraction; broader scope; complex breach attribution
AI inference services	Token throughput, time-to-first-token, p99 latency, availability	Latency-sensitive; capacity-constrained; new SLO categories without long history
AI training services (neo-cloud)	GPU allocation availability; cluster networking performance	Capacity contracts more important than uptime; lengthy outages may be acceptable if amortized over training run
Sovereign cloud	Standard cloud SLAs plus residency, access control, citizenship guarantees	Compliance commitments often more critical than availability commitments
Enterprise self-hosted	Internal SLAs to business units; no external customer SLA	Internal "credits" are organizational; politics and accountability rather than financial

AI inference SLOs

AI inference services have introduced SLO categories that traditional cloud computing SLAs didn't address. The metrics that matter for inference are not just availability but also throughput, latency at multiple percentiles, and capacity guarantees - and the SLOs vary substantially by model size, hardware tier, and customer commitment level.

SLO	What it measures	Why it matters for inference
Time to first token (TTFT)	Latency from request to first output token	Dominant user-experience metric for chat and streaming inference; sub-second target standard
Inter-token latency / TPOT	Time between consecutive output tokens	Drives perceived response speed; tens of milliseconds typical target
Throughput (tokens per second)	Tokens delivered per second per request or aggregate	Capacity planning; per-customer commitment; concurrent request handling
P99 latency	99th percentile request completion time	Captures tail latency; matters for time-sensitive applications
Capacity availability	Percentage of requests served vs throttled or rejected	During high-demand periods; capacity-constrained models often have explicit throttling
Provisioned throughput SLA	Reserved capacity commitment for enterprise customers	Premium tier; bypasses general capacity throttling

Error budgets and SRE practice

The error budget concept (originated at Google, codified in the SRE book) treats the gap between 100% availability and the SLO as a finite resource that engineering can spend. If the SLO is 99.9% (allowing ~8.76 hours of downtime per year), the team has an "error budget" of those 8.76 hours that they can consume through deployments, feature changes, or incidents. Burning through the error budget triggers a freeze on new deployments until reliability is restored. The framework converts the abstract reliability conversation into a concrete tradeoff between feature velocity and operational stability.

SRE practice has spread well beyond Google. Most hyperscalers, large neo-clouds, and major SaaS providers operate SRE teams with formalized error budget policies. The specific implementations vary - some teams operate strict freezes, others negotiate exceptions, some use error budget as input to release approval processes. The core idea (concrete reliability target as currency for engineering decisions) is now standard practice. Companies like Nobl9, Datadog SLOs, Grafana SLO, and Honeycomb provide tooling; major SLO frameworks include OpenSLO and the SLO definitions in OpenTelemetry.

Measurement and telemetry

SLI accuracy depends on telemetry quality. The infrastructure that produces SLIs spans multiple monitoring layers: Power Monitoring for facility availability evidence, Cooling Monitoring for thermal envelope compliance, Telemetry & Observability for application-layer measurement, Network Operations for network-layer measurement, and customer-side synthetic probes for end-to-end measurement that captures the customer experience rather than the operator's internal view. The relationship between operator-internal and customer-side measurement is itself a source of SLA disputes - the operator's metrics may show SLA compliance while the customer's experience shows breach, often due to network paths or client-side issues that the operator's measurement doesn't capture.

SLA exclusions and force majeure

SLAs typically include exclusions that can substantially limit when service credits actually apply. Standard exclusions include scheduled maintenance (announced in advance), force majeure (natural disasters, pandemic, war), customer-side issues (configuration errors, network problems on the customer's side), DDoS attacks against the customer's specific resources, and third-party service outages (DNS, certificate authorities, upstream providers). The exclusion language is heavily negotiated for enterprise contracts and represents a real source of dispute when major incidents occur. Customer-side legal review of SLA exclusions is a standard part of cloud procurement, and the gap between marketed SLA percentages and effective post-exclusion coverage is often substantial.

SLA reporting and compliance

SLA performance becomes audit evidence in multiple compliance frameworks. SOC 2 expects documented service performance; ISO 27001 covers service management; ISO 20000 specifically covers IT service management with SLA discipline as a core requirement; SEC Climate Disclosure includes operational performance disclosures for material services. Compliance customers (financial services, government, healthcare) often require SLA reporting cadence and auditability above what general-purpose customers receive. The reporting and audit-evidence flow connects to GRC:Compliance and GRC:Auditability; the specific control catalogs that require SLA evidence connect to GRC:Controls.

Where this fits

SLA/SLO management is the commitment-and-measurement discipline operated under Compute Ops. The underlying resilience that delivers commitments is operated under Resilience & Reliability in FACILITY OPS for facility-layer concerns and Compute Ops for IT/application layer. The telemetry that produces SLIs flows from Telemetry & Observability. Predictive operations (preventing breaches before they occur) is the role of AIOps and Production Reliability Engineering. SLA reporting becomes audit evidence in GRC:Compliance and GRC:Auditability. The contractual structure of SLAs and exclusions ties to commercial operations and is part of customer onboarding under Business Models.

Related coverage