Data Center Inference Overview
Inference is the real-time or near–real-time execution of trained models to produce predictions, generations, or decisions. Unlike training (long-lived, batch-optimized), inference is user-facing and SLO-driven, with tight latency, cost, and reliability constraints across web, mobile, enterprise, and edge applications.
Scope & Definitions
- Online inference: Single-request, low-latency (interactive apps, APIs).
- Streaming inference: Token-by-token or event streams (chat, transcription, RTP/RTMP).
- Batch/offline inference: High-throughput, relaxed latency (ETL scoring, nightly refresh).
- Near-edge/edge inference: Close to devices/users for sub-20 ms needs (AR/VR, robotics).
Inference Deployments
AI inference workloads can be deployed across multiple environments — from hyperscale data centers to edge nodes and end-user devices. Each deployment model balances latency, cost, compliance, and scalability differently. This page provides a high-level comparison of inference across hyperscale, on-prem, edge, and device contexts.
Deployment Context |
Focus |
Latency Target |
Scale |
Best-Fit Use Cases |
Examples |
Hyperscale Data Centers |
Global inference APIs, elastic scale |
100–200 ms |
Thousands of racks, millions of req/s |
LLM APIs, SaaS copilots, search & ads |
OpenAI API, Anthropic Claude, Google Vertex AI, Meta |
Local / On-Prem DCs |
Private inference, compliance-driven |
10–100 ms |
Enterprise-scale clusters (100s–1000s GPUs) |
Healthcare, finance, gov, legal |
JP Morgan Athena AI, Epic EHR AI Assist, DoD AI pilots |
Edge Data Centers |
Low-latency inference close to users |
<20 ms |
50 kW–multi-MW metro nodes |
AR/VR, robotics, V2X, smart cities |
AWS Wavelength, Cloudflare Workers AI, NVIDIA Metropolis |
On Devices |
Ultra-low latency, offline privacy |
<10 ms |
Billions of devices worldwide |
Smartphones, robotaxis, humanoids, IoT |
Apple Neural Engine, Tesla FSD, NVIDIA Jetson, Qualcomm NPUs |
Key Takeaways
- Hyperscale: Best for elastic workloads and global APIs.
- On-Prem: Best for compliance, data sovereignty, and integration with enterprise IT.
- Edge: Best for latency-critical use cases (mobility, AR/VR, industrial IoT).
- Devices: Best for offline, privacy-preserving, and ultra-fast inference.
Latency Tiers & SLO Heuristics
Tier |
Target p95 |
Typical Use |
Infra Implications |
Ultra-low |
< 20 ms |
AR/VR, industrial control, trading |
Edge GPUs/ASICs, on-die/near-memory, lightweight models |
Interactive |
20–200 ms |
Search, recsys, autocomplete, fraud checks |
In-memory features/vector DB, CPU+GPU mix, caching |
Conversational |
200–1000 ms TTFB; streaming tokens |
Chat, copilots, agents |
High-throughput GPUs, KV-cache, quantized models, speculative decoding |
Batch |
Seconds–hours |
Backfills, risk scoring, embeddings refresh |
Throughput-first schedulers, cheaper accelerators/CPU fleets |
SLA, SLO, and SLI Explained
Inference workloads are often governed by strict performance and reliability targets. These are described using three related concepts:
Term |
Definition |
Example in Inference |
SLI (Service Level Indicator) |
The actual metric being measured. |
p95 latency of inference API calls |
SLO (Service Level Objective) |
The internal engineering target for the SLI. |
p95 latency should be < 200 ms |
SLA (Service Level Agreement) |
The external customer-facing contract or promise, often with penalties. |
99.9% of inference API calls under 200 ms per month |
Key point: SLIs are metrics, SLOs are goals, and SLAs are binding commitments. Inference systems are usually engineered to meet SLOs so that SLAs are never breached.
Reference Architecture
- Ingress/API: Gateways, auth, rate limits, schema validation.
- Router/Orchestrator: Policy-based model selection (cost, latency, accuracy), A/B and canary.
- Feature/Context Layer: Vector DB + RAG, feature store, session/KV cache, retrieval pipelines.
- Model Serving: GPU/CPU/ASIC backends; tensor/LLM serving (TP, PP, SP, vLLM, TensorRT-LLM).
- Post-processing: Re-ranking, safety filters, tool-use, function calling.
- Observability: Latency/throughput, token/utilization, quality evals, drift detection.
- Cost Control: Token budgets, max context, batching, quantization/distillation, autoscaling.
- Resilience: Multi-region, circuit breakers, shadow traffic, graceful degradation (fallback models).
Patterns & Design Choices
- Model selection: Small-fast default ? escalate to larger models on-demand (“cascade”).
- Compression: Quantization (FP8/INT8/4), pruned adapters, distillation for throughput.
- Context mgmt: RAG with domain indexes; cache KV states for long dialogs.
- Batching vs latency: Dynamic batching for throughput; micro-batching under latency SLAs.
- Speculative decoding: Draft small model + verify by large model to cut latency.
- Caching: Prompt/result caching; embedding cache for frequent queries.
- Placement: Central (hyperscale/colo) for scale; edge/micro for deterministic low latency.
Bill of Materials (BOM)
Domain |
Examples |
Role |
Accelerators |
NVIDIA L40S/H100/H200, AMD MI300, Intel Gaudi, Edge TPUs/NPUs |
Token and tensor throughput, batch efficiency |
Serving Runtimes |
vLLM, TensorRT-LLM, Triton Inference Server, ONNX Runtime |
Kernel fusion, paged KV cache, scheduling |
Routers/Orchestrators |
Custom policy engines, gateway LBs, canary controllers |
Model selection, A/B, fallback, cost governance |
Retrieval |
Vector DBs (FAISS, Milvus, pgvector), feature stores |
Context/RAG augmentation, personalization |
Observability |
Token meters, tracing (OpenTelemetry), eval pipelines |
SLOs, drift/guardrail monitoring, cost tracking |
Networking |
RoCE/IB for GPU pools; high-fanout Ethernet for API edges |
Low tail-latency, efficient batching |
Cooling |
Rear-door HX, D2C liquid for dense inference racks |
Thermal stability at high utilization |
Where Inference Runs (Facility Fit)
Workload Mode |
Best-Fit Types |
Also Works |
Notes |
Interactive API |
Hyperscale, Colo |
AI Factory (shared), Enterprise |
Global anycast, multi-region, cost-sensitive |
Conversational/Agents |
Hyperscale |
Colo, Enterprise |
Streaming + KV cache, safety/filter stacks |
Edge Realtime |
Edge/Micro |
Metro Colo |
Deterministic latency, small models, NPU/ASIC |
Batch Scoring |
Hyperscale, Enterprise |
HPC (when co-located) |
Throughput-first, spot capacity, CPU-heavy OK |
Training Cluster <--> Device Communication
Direction |
Flow |
Mechanism |
Purpose |
Upstream |
Sensor data, telemetry, failure cases |
5G/LTE uplink, Wi-Fi offload, batch uploads |
Collect edge cases to improve central training datasets |
Downstream |
Model updates, weights, safety patches |
OTA (over-the-air) updates via secure channels |
Push retrained/improved models to fleets |
Bi-Directional (Realtime) |
Optional fleet coordination, map updates |
Low-latency V2X, edge cache sync |
Enhance situational awareness without compromising autonomy |
Challenges
- Bandwidth: Raw sensor/video data is too large for continuous uplink; selective logging and compression required.
- Latency: Safety functions must run locally; remote inference is not viable for control loops.
- Security: OTA updates require strong cryptographic signing; telemetry uplinks must protect PII/driver data.
- Versioning: Fleets may run mixed model versions; orchestration needed to manage rollouts safely.
- Feedback Loop: Fleet data ? training clusters ? OTA deployment is the backbone of autonomy improvement.
Cost & Efficiency Levers
- Model right-sizing: Distilled/quantized models for 80/20 accuracy at 20–70% cost.
- Context control: Prompt compression, retrieval filters, max tokens, caching.
- Autoscaling: Predictive scale-up; protect against cold-start with warm pools.
- Batching & TPUs/ASICs: Match hardware to traffic patterns and model kernels.
- Carbon-aware routing: Shift non-urgent inference to cleaner regions/hours.
Quality, Safety, & Evaluation
- Online evals: A/B, interleaving, reward models, human feedback loops.
- Guardrails: Content filters, jailbreak detection, PII/PHI redaction, policy prompts.
- Drift & regressions: Data shift alarms, sampling, golden sets, shadow deployments.
Resilience & Operations
- Multi-region active/active: Anycast, state-light design, replicated indexes/caches.
- Fallback ladders: Primary ? distilled backup ? template-based or rule-based last resort.
- Circuit breakers: Shed load gracefully, token/time caps, backpressure.
- SLO policy: Per-route budgets for latency, cost, and quality; SRE error budgets apply.
Vendors & Ecosystem (Illustrative)
Category |
Examples |
Notes |
Model Serving |
vLLM, TensorRT-LLM, Triton, ONNX Runtime |
Throughput & KV-cache efficiency |
Vector / Retrieval |
FAISS, Milvus, Weaviate, pgvector, Elastic |
RAG building blocks |
Observability |
OpenTelemetry, Prometheus, tracing + eval suites |
Latency, tokens, quality metrics |
Hardware |
NVIDIA, AMD, Intel, edge NPUs/ASICs |
Right-size per tier/region |
Sustainability
- Energy profile: Higher QPS ? sustained high utilization; optimize PUE and liquid cooling.
- CFE alignment: Route batch/elastic inference to cleaner grids/hours; disclose Scope 2 per workload.
- Thermal design: D2C/immersion for dense inference fleets; rear-door HX for mixed racks.
Future Outlook
- Hybrid Inference: Workloads dynamically split between cloud, edge, and devices.
- Federated Learning & Inference: Models improving locally without centralizing data.
- ASIC Acceleration: Rise of device- and edge-optimized chips for inference.
- Sovereign AI: Nations and enterprises requiring inference within local boundaries.
- Energy Optimization: Efficiency-first design across all inference contexts.
FAQ
- How is inference different from training? Inference is SLO-driven and latency-sensitive; training is throughput- and time-to-train–driven.
- Do I need GPUs for all inference? No—CPUs can serve small models/batch; GPUs/ASICs shine for LLMs, vision, and high QPS.
- Central vs edge? Central maximizes scale/utilization; edge meets strict latency and data-locality needs.
- How to cut cost without killing quality? Distill/quantize, cache, cascade models, and use RAG to shrink prompts.
- How to keep it safe? Layer retrieval filters, prompts/guardrails, and online evals with human oversight for sensitive domains.