Data Center Inference Overview


Inference is the real-time or near–real-time execution of trained models to produce predictions, generations, or decisions. Unlike training (long-lived, batch-optimized), inference is user-facing and SLO-driven, with tight latency, cost, and reliability constraints across web, mobile, enterprise, and edge applications.


Scope & Definitions

  • Online inference: Single-request, low-latency (interactive apps, APIs).
  • Streaming inference: Token-by-token or event streams (chat, transcription, RTP/RTMP).
  • Batch/offline inference: High-throughput, relaxed latency (ETL scoring, nightly refresh).
  • Near-edge/edge inference: Close to devices/users for sub-20 ms needs (AR/VR, robotics).

Inference Deployments

AI inference workloads can be deployed across multiple environments — from hyperscale data centers to edge nodes and end-user devices. Each deployment model balances latency, cost, compliance, and scalability differently. This page provides a high-level comparison of inference across hyperscale, on-prem, edge, and device contexts.

Deployment Context Focus Latency Target Scale Best-Fit Use Cases Examples
Hyperscale Data Centers Global inference APIs, elastic scale 100–200 ms Thousands of racks, millions of req/s LLM APIs, SaaS copilots, search & ads OpenAI API, Anthropic Claude, Google Vertex AI, Meta
Local / On-Prem DCs Private inference, compliance-driven 10–100 ms Enterprise-scale clusters (100s–1000s GPUs) Healthcare, finance, gov, legal JP Morgan Athena AI, Epic EHR AI Assist, DoD AI pilots
Edge Data Centers Low-latency inference close to users <20 ms 50 kW–multi-MW metro nodes AR/VR, robotics, V2X, smart cities AWS Wavelength, Cloudflare Workers AI, NVIDIA Metropolis
On Devices Ultra-low latency, offline privacy <10 ms Billions of devices worldwide Smartphones, robotaxis, humanoids, IoT Apple Neural Engine, Tesla FSD, NVIDIA Jetson, Qualcomm NPUs

Key Takeaways

  • Hyperscale: Best for elastic workloads and global APIs.
  • On-Prem: Best for compliance, data sovereignty, and integration with enterprise IT.
  • Edge: Best for latency-critical use cases (mobility, AR/VR, industrial IoT).
  • Devices: Best for offline, privacy-preserving, and ultra-fast inference.

Latency Tiers & SLO Heuristics

Tier Target p95 Typical Use Infra Implications
Ultra-low < 20 ms AR/VR, industrial control, trading Edge GPUs/ASICs, on-die/near-memory, lightweight models
Interactive 20–200 ms Search, recsys, autocomplete, fraud checks In-memory features/vector DB, CPU+GPU mix, caching
Conversational 200–1000 ms TTFB; streaming tokens Chat, copilots, agents High-throughput GPUs, KV-cache, quantized models, speculative decoding
Batch Seconds–hours Backfills, risk scoring, embeddings refresh Throughput-first schedulers, cheaper accelerators/CPU fleets

SLA, SLO, and SLI Explained

Inference workloads are often governed by strict performance and reliability targets. These are described using three related concepts:

Term Definition Example in Inference
SLI (Service Level Indicator) The actual metric being measured. p95 latency of inference API calls
SLO (Service Level Objective) The internal engineering target for the SLI. p95 latency should be < 200 ms
SLA (Service Level Agreement) The external customer-facing contract or promise, often with penalties. 99.9% of inference API calls under 200 ms per month

Key point: SLIs are metrics, SLOs are goals, and SLAs are binding commitments. Inference systems are usually engineered to meet SLOs so that SLAs are never breached.


Reference Architecture

  • Ingress/API: Gateways, auth, rate limits, schema validation.
  • Router/Orchestrator: Policy-based model selection (cost, latency, accuracy), A/B and canary.
  • Feature/Context Layer: Vector DB + RAG, feature store, session/KV cache, retrieval pipelines.
  • Model Serving: GPU/CPU/ASIC backends; tensor/LLM serving (TP, PP, SP, vLLM, TensorRT-LLM).
  • Post-processing: Re-ranking, safety filters, tool-use, function calling.
  • Observability: Latency/throughput, token/utilization, quality evals, drift detection.
  • Cost Control: Token budgets, max context, batching, quantization/distillation, autoscaling.
  • Resilience: Multi-region, circuit breakers, shadow traffic, graceful degradation (fallback models).

Patterns & Design Choices

  • Model selection: Small-fast default ? escalate to larger models on-demand (“cascade”).
  • Compression: Quantization (FP8/INT8/4), pruned adapters, distillation for throughput.
  • Context mgmt: RAG with domain indexes; cache KV states for long dialogs.
  • Batching vs latency: Dynamic batching for throughput; micro-batching under latency SLAs.
  • Speculative decoding: Draft small model + verify by large model to cut latency.
  • Caching: Prompt/result caching; embedding cache for frequent queries.
  • Placement: Central (hyperscale/colo) for scale; edge/micro for deterministic low latency.

Bill of Materials (BOM)

Domain Examples Role
Accelerators NVIDIA L40S/H100/H200, AMD MI300, Intel Gaudi, Edge TPUs/NPUs Token and tensor throughput, batch efficiency
Serving Runtimes vLLM, TensorRT-LLM, Triton Inference Server, ONNX Runtime Kernel fusion, paged KV cache, scheduling
Routers/Orchestrators Custom policy engines, gateway LBs, canary controllers Model selection, A/B, fallback, cost governance
Retrieval Vector DBs (FAISS, Milvus, pgvector), feature stores Context/RAG augmentation, personalization
Observability Token meters, tracing (OpenTelemetry), eval pipelines SLOs, drift/guardrail monitoring, cost tracking
Networking RoCE/IB for GPU pools; high-fanout Ethernet for API edges Low tail-latency, efficient batching
Cooling Rear-door HX, D2C liquid for dense inference racks Thermal stability at high utilization

Where Inference Runs (Facility Fit)

Workload Mode Best-Fit Types Also Works Notes
Interactive API Hyperscale, Colo AI Factory (shared), Enterprise Global anycast, multi-region, cost-sensitive
Conversational/Agents Hyperscale Colo, Enterprise Streaming + KV cache, safety/filter stacks
Edge Realtime Edge/Micro Metro Colo Deterministic latency, small models, NPU/ASIC
Batch Scoring Hyperscale, Enterprise HPC (when co-located) Throughput-first, spot capacity, CPU-heavy OK

Training Cluster <--> Device Communication

Direction Flow Mechanism Purpose
Upstream Sensor data, telemetry, failure cases 5G/LTE uplink, Wi-Fi offload, batch uploads Collect edge cases to improve central training datasets
Downstream Model updates, weights, safety patches OTA (over-the-air) updates via secure channels Push retrained/improved models to fleets
Bi-Directional (Realtime) Optional fleet coordination, map updates Low-latency V2X, edge cache sync Enhance situational awareness without compromising autonomy

Challenges

  • Bandwidth: Raw sensor/video data is too large for continuous uplink; selective logging and compression required.
  • Latency: Safety functions must run locally; remote inference is not viable for control loops.
  • Security: OTA updates require strong cryptographic signing; telemetry uplinks must protect PII/driver data.
  • Versioning: Fleets may run mixed model versions; orchestration needed to manage rollouts safely.
  • Feedback Loop: Fleet data ? training clusters ? OTA deployment is the backbone of autonomy improvement.

Cost & Efficiency Levers

  • Model right-sizing: Distilled/quantized models for 80/20 accuracy at 20–70% cost.
  • Context control: Prompt compression, retrieval filters, max tokens, caching.
  • Autoscaling: Predictive scale-up; protect against cold-start with warm pools.
  • Batching & TPUs/ASICs: Match hardware to traffic patterns and model kernels.
  • Carbon-aware routing: Shift non-urgent inference to cleaner regions/hours.

Quality, Safety, & Evaluation

  • Online evals: A/B, interleaving, reward models, human feedback loops.
  • Guardrails: Content filters, jailbreak detection, PII/PHI redaction, policy prompts.
  • Drift & regressions: Data shift alarms, sampling, golden sets, shadow deployments.

Resilience & Operations

  • Multi-region active/active: Anycast, state-light design, replicated indexes/caches.
  • Fallback ladders: Primary ? distilled backup ? template-based or rule-based last resort.
  • Circuit breakers: Shed load gracefully, token/time caps, backpressure.
  • SLO policy: Per-route budgets for latency, cost, and quality; SRE error budgets apply.

Vendors & Ecosystem (Illustrative)

Category Examples Notes
Model Serving vLLM, TensorRT-LLM, Triton, ONNX Runtime Throughput & KV-cache efficiency
Vector / Retrieval FAISS, Milvus, Weaviate, pgvector, Elastic RAG building blocks
Observability OpenTelemetry, Prometheus, tracing + eval suites Latency, tokens, quality metrics
Hardware NVIDIA, AMD, Intel, edge NPUs/ASICs Right-size per tier/region

Sustainability

  • Energy profile: Higher QPS ? sustained high utilization; optimize PUE and liquid cooling.
  • CFE alignment: Route batch/elastic inference to cleaner grids/hours; disclose Scope 2 per workload.
  • Thermal design: D2C/immersion for dense inference fleets; rear-door HX for mixed racks.

Future Outlook

  • Hybrid Inference: Workloads dynamically split between cloud, edge, and devices.
  • Federated Learning & Inference: Models improving locally without centralizing data.
  • ASIC Acceleration: Rise of device- and edge-optimized chips for inference.
  • Sovereign AI: Nations and enterprises requiring inference within local boundaries.
  • Energy Optimization: Efficiency-first design across all inference contexts.

FAQ

  • How is inference different from training? Inference is SLO-driven and latency-sensitive; training is throughput- and time-to-train–driven.
  • Do I need GPUs for all inference? No—CPUs can serve small models/batch; GPUs/ASICs shine for LLMs, vision, and high QPS.
  • Central vs edge? Central maximizes scale/utilization; edge meets strict latency and data-locality needs.
  • How to cut cost without killing quality? Distill/quantize, cache, cascade models, and use RAG to shrink prompts.
  • How to keep it safe? Layer retrieval filters, prompts/guardrails, and online evals with human oversight for sensitive domains.