Data Center Inference Overview

Inference is the real-time or near–real-time execution of trained models to produce predictions, generations, or decisions. Unlike training (long-lived, batch-optimized), inference is user-facing and SLO-driven, with tight latency, cost, and reliability constraints across web, mobile, enterprise, and edge applications.

Scope & Definitions

Online inference: Single-request, low-latency (interactive apps, APIs).
Streaming inference: Token-by-token or event streams (chat, transcription, RTP/RTMP).
Batch/offline inference: High-throughput, relaxed latency (ETL scoring, nightly refresh).
Near-edge/edge inference: Close to devices/users for sub-20 ms needs (AR/VR, robotics).

Inference Deployments

AI inference workloads can be deployed across multiple environments — from hyperscale data centers to edge nodes and end-user devices. Each deployment model balances latency, cost, compliance, and scalability differently. This page provides a high-level comparison of inference across hyperscale, on-prem, edge, and device contexts.

Deployment Context	Focus	Latency Target	Scale	Best-Fit Use Cases	Examples
Hyperscale Data Centers	Global inference APIs, elastic scale	100–200 ms	Thousands of racks, millions of req/s	LLM APIs, SaaS copilots, search & ads	OpenAI API, Anthropic Claude, Google Vertex AI, Meta
Local / On-Prem DCs	Private inference, compliance-driven	10–100 ms	Enterprise-scale clusters (100s–1000s GPUs)	Healthcare, finance, gov, legal	JP Morgan Athena AI, Epic EHR AI Assist, DoD AI pilots
Edge Data Centers	Low-latency inference close to users	<20 ms	50 kW–multi-MW metro nodes	AR/VR, robotics, V2X, smart cities	AWS Wavelength, Cloudflare Workers AI, NVIDIA Metropolis
On Devices	Ultra-low latency, offline privacy	<10 ms	Billions of devices worldwide	Smartphones, robotaxis, humanoids, IoT	Apple Neural Engine, Tesla FSD, NVIDIA Jetson, Qualcomm NPUs

Key Takeaways

Hyperscale: Best for elastic workloads and global APIs.
On-Prem: Best for compliance, data sovereignty, and integration with enterprise IT.
Edge: Best for latency-critical use cases (mobility, AR/VR, industrial IoT).
Devices: Best for offline, privacy-preserving, and ultra-fast inference.

Latency Tiers & SLO Heuristics

Tier	Target p95	Typical Use	Infra Implications
Ultra-low	< 20 ms	AR/VR, industrial control, trading	Edge GPUs/ASICs, on-die/near-memory, lightweight models
Interactive	20–200 ms	Search, recsys, autocomplete, fraud checks	In-memory features/vector DB, CPU+GPU mix, caching
Conversational	200–1000 ms TTFB; streaming tokens	Chat, copilots, agents	High-throughput GPUs, KV-cache, quantized models, speculative decoding
Batch	Seconds–hours	Backfills, risk scoring, embeddings refresh	Throughput-first schedulers, cheaper accelerators/CPU fleets

SLA, SLO, and SLI Explained

Inference workloads are often governed by strict performance and reliability targets. These are described using three related concepts:

Term	Definition	Example in Inference
SLI (Service Level Indicator)	The actual metric being measured.	p95 latency of inference API calls
SLO (Service Level Objective)	The internal engineering target for the SLI.	p95 latency should be < 200 ms
SLA (Service Level Agreement)	The external customer-facing contract or promise, often with penalties.	99.9% of inference API calls under 200 ms per month

Key point: SLIs are metrics, SLOs are goals, and SLAs are binding commitments. Inference systems are usually engineered to meet SLOs so that SLAs are never breached.

Reference Architecture

Ingress/API: Gateways, auth, rate limits, schema validation.
Router/Orchestrator: Policy-based model selection (cost, latency, accuracy), A/B and canary.
Feature/Context Layer: Vector DB + RAG, feature store, session/KV cache, retrieval pipelines.
Model Serving: GPU/CPU/ASIC backends; tensor/LLM serving (TP, PP, SP, vLLM, TensorRT-LLM).
Post-processing: Re-ranking, safety filters, tool-use, function calling.
Observability: Latency/throughput, token/utilization, quality evals, drift detection.
Cost Control: Token budgets, max context, batching, quantization/distillation, autoscaling.
Resilience: Multi-region, circuit breakers, shadow traffic, graceful degradation (fallback models).

Patterns & Design Choices

Model selection: Small-fast default ? escalate to larger models on-demand (“cascade”).
Compression: Quantization (FP8/INT8/4), pruned adapters, distillation for throughput.
Context mgmt: RAG with domain indexes; cache KV states for long dialogs.
Batching vs latency: Dynamic batching for throughput; micro-batching under latency SLAs.
Speculative decoding: Draft small model + verify by large model to cut latency.
Caching: Prompt/result caching; embedding cache for frequent queries.
Placement: Central (hyperscale/colo) for scale; edge/micro for deterministic low latency.

Bill of Materials (BOM)

Domain	Examples	Role
Accelerators	NVIDIA L40S/H100/H200, AMD MI300, Intel Gaudi, Edge TPUs/NPUs	Token and tensor throughput, batch efficiency
Serving Runtimes	vLLM, TensorRT-LLM, Triton Inference Server, ONNX Runtime	Kernel fusion, paged KV cache, scheduling
Routers/Orchestrators	Custom policy engines, gateway LBs, canary controllers	Model selection, A/B, fallback, cost governance
Retrieval	Vector DBs (FAISS, Milvus, pgvector), feature stores	Context/RAG augmentation, personalization
Observability	Token meters, tracing (OpenTelemetry), eval pipelines	SLOs, drift/guardrail monitoring, cost tracking
Networking	RoCE/IB for GPU pools; high-fanout Ethernet for API edges	Low tail-latency, efficient batching
Cooling	Rear-door HX, D2C liquid for dense inference racks	Thermal stability at high utilization

Where Inference Runs (Facility Fit)

Workload Mode	Best-Fit Types	Also Works	Notes
Interactive API	Hyperscale, Colo	AI Factory (shared), Enterprise	Global anycast, multi-region, cost-sensitive
Conversational/Agents	Hyperscale	Colo, Enterprise	Streaming + KV cache, safety/filter stacks
Edge Realtime	Edge/Micro	Metro Colo	Deterministic latency, small models, NPU/ASIC
Batch Scoring	Hyperscale, Enterprise	HPC (when co-located)	Throughput-first, spot capacity, CPU-heavy OK

Training Cluster <--> Device Communication

Direction	Flow	Mechanism	Purpose
Upstream	Sensor data, telemetry, failure cases	5G/LTE uplink, Wi-Fi offload, batch uploads	Collect edge cases to improve central training datasets
Downstream	Model updates, weights, safety patches	OTA (over-the-air) updates via secure channels	Push retrained/improved models to fleets
Bi-Directional (Realtime)	Optional fleet coordination, map updates	Low-latency V2X, edge cache sync	Enhance situational awareness without compromising autonomy

Challenges

Bandwidth: Raw sensor/video data is too large for continuous uplink; selective logging and compression required.
Latency: Safety functions must run locally; remote inference is not viable for control loops.
Security: OTA updates require strong cryptographic signing; telemetry uplinks must protect PII/driver data.
Versioning: Fleets may run mixed model versions; orchestration needed to manage rollouts safely.
Feedback Loop: Fleet data ? training clusters ? OTA deployment is the backbone of autonomy improvement.

Cost & Efficiency Levers

Model right-sizing: Distilled/quantized models for 80/20 accuracy at 20–70% cost.
Context control: Prompt compression, retrieval filters, max tokens, caching.
Autoscaling: Predictive scale-up; protect against cold-start with warm pools.
Batching & TPUs/ASICs: Match hardware to traffic patterns and model kernels.
Carbon-aware routing: Shift non-urgent inference to cleaner regions/hours.

Quality, Safety, & Evaluation

Online evals: A/B, interleaving, reward models, human feedback loops.
Guardrails: Content filters, jailbreak detection, PII/PHI redaction, policy prompts.
Drift & regressions: Data shift alarms, sampling, golden sets, shadow deployments.

Resilience & Operations

Multi-region active/active: Anycast, state-light design, replicated indexes/caches.
Fallback ladders: Primary ? distilled backup ? template-based or rule-based last resort.
Circuit breakers: Shed load gracefully, token/time caps, backpressure.
SLO policy: Per-route budgets for latency, cost, and quality; SRE error budgets apply.

Vendors & Ecosystem (Illustrative)

Category	Examples	Notes
Model Serving	vLLM, TensorRT-LLM, Triton, ONNX Runtime	Throughput & KV-cache efficiency
Vector / Retrieval	FAISS, Milvus, Weaviate, pgvector, Elastic	RAG building blocks
Observability	OpenTelemetry, Prometheus, tracing + eval suites	Latency, tokens, quality metrics
Hardware	NVIDIA, AMD, Intel, edge NPUs/ASICs	Right-size per tier/region

Sustainability

Energy profile: Higher QPS ? sustained high utilization; optimize PUE and liquid cooling.
CFE alignment: Route batch/elastic inference to cleaner grids/hours; disclose Scope 2 per workload.
Thermal design: D2C/immersion for dense inference fleets; rear-door HX for mixed racks.

Future Outlook

Hybrid Inference: Workloads dynamically split between cloud, edge, and devices.
Federated Learning & Inference: Models improving locally without centralizing data.
ASIC Acceleration: Rise of device- and edge-optimized chips for inference.
Sovereign AI: Nations and enterprises requiring inference within local boundaries.
Energy Optimization: Efficiency-first design across all inference contexts.

FAQ

How is inference different from training? Inference is SLO-driven and latency-sensitive; training is throughput- and time-to-train–driven.
Do I need GPUs for all inference? No—CPUs can serve small models/batch; GPUs/ASICs shine for LLMs, vision, and high QPS.
Central vs edge? Central maximizes scale/utilization; edge meets strict latency and data-locality needs.
How to cut cost without killing quality? Distill/quantize, cache, cascade models, and use RAG to shrink prompts.
How to keep it safe? Layer retrieval filters, prompts/guardrails, and online evals with human oversight for sensitive domains.