AI Inference Workloads


AI inference is the process of executing trained models to produce outputs — predictions, recommendations, classifications, or generations. Inference is latency-sensitive, cost-sensitive, and reliability-critical. Unlike training, which is centralized and batch-driven, inference is distributed across hyperscale data centers, colocation hubs, edge sites, and even embedded devices in vehicles and humanoid robots.


Overview

  • Purpose: Serve trained models to end users and applications in real-time or near real-time.
  • Scale: Billions of daily requests across search, recommendation, chat, and content platforms.
  • Characteristics: Tight p95/p99 latency targets, high QPS throughput, cascading models, caching layers, cost per token/query.
  • Comparison: Inference runs continuously at global scale, while training runs episodically in massive clusters.

Inference Modes

  • Interactive APIs: Search, autocomplete, fraud checks — 20–200 ms latency budgets.
  • Conversational / Generative: Chat, copilots, assistants — streaming token generation, sub-second TTFB.
  • Batch scoring: Embedding generation, nightly ETL pipelines, recommendation refreshes — relaxed latency, throughput first.
  • Edge inference: Autonomous vehicles, humanoid robots, AR/VR devices — sub-20 ms deterministic cycles, on-device accelerators.

Bill of Materials (BOM)

Domain Examples Role
Accelerators NVIDIA L40S, H100/H200, AMD MI300A/X, Intel Gaudi, Edge TPUs, NPUs Token-by-token or batch inference at high throughput
Serving Frameworks TensorRT-LLM, vLLM, Triton, ONNX Runtime Optimized model execution with batching and KV-cache support
Retrieval & Context Vector DBs (Milvus, Weaviate, pgvector), Redis, Pinecone Enable RAG, personalization, and low-latency lookups
Routers & Orchestrators Custom policy engines, model cascades, A/B controllers Select model variants by cost, latency, accuracy
Networking RoCE/InfiniBand for GPU pools; 400G Ethernet for API edges Ensure low tail-latency and efficient batching
Cooling Rear-door HX, liquid-cooled racks Manage dense inference clusters at 40–80 kW/rack
Observability OpenTelemetry, token meters, quality eval pipelines Track SLOs, drift, and cost efficiency

Facility Alignment

Inference Mode Best-Fit Facilities Also Runs In Notes
Interactive APIs Hyperscale, Colocation Enterprise Global distribution, API latency targets
Conversational / Agents Hyperscale Colo, Enterprise Streaming token output, context caching
Batch Scoring Hyperscale, Enterprise HPC (co-located) Throughput priority, cost-optimized hardware
Edge Realtime Edge / Micro Metro Colo Sub-20 ms cycles for robotics/autonomy

Inference in Vehicles & Humanoids

Inference is increasingly embedded in autonomous vehicles and humanoid robots. These workloads are safety-critical, deterministic, and run entirely on-device, with upstream and downstream communication to training clusters:

Direction Flow Mechanism Purpose
Upstream Telemetry, failure cases, sensor snapshots 5G/LTE, Wi-Fi batch uploads Enrich training datasets with edge cases
Downstream Model updates, patches, weights OTA updates via secure channels Deploy improved models to fleets
Realtime (Optional) Map updates, V2X data Low-latency network sync Augment local inference without cloud dependency

Key Challenges

  • Latency: Meeting sub-100 ms targets for interactive APIs; sub-20 ms for autonomy/robotics.
  • Cost per query: Serving billions of queries/token generations economically.
  • Model scaling: Large models are expensive to serve; requires quantization, distillation, or cascades.
  • Observability: Monitoring SLOs, drift, and bias at production scale.
  • Security: Protecting model IP, PII, and integrity of OTA updates.

Notable Deployments

Deployment Operator Scale Notes
ChatGPT API OpenAI / Microsoft Azure Global, billions of requests/day Conversational inference at hyperscale
YouTube Recommendations Google Petabytes/day processed High-throughput recsys inference
Tesla FSD Tesla/xAI Millions of cars On-device FSD inference with OTA model updates
NVIDIA Jetson Inference NVIDIA Tens of thousands of edge robots/drones Embedded inference for robotics and automation

Future Outlook

  • Edge-first inference: Growth in NPUs and ASICs in cars, robots, and devices.
  • Model optimization: Distillation, quantization, pruning to reduce costs.
  • Federated inference: On-device predictions enhanced by fleet or cloud collaboration.
  • Carbon-aware serving: Routing batch inference to cleaner grids/hours.
  • Security-first OTA: Formal verification of model updates for safety-critical platforms.

FAQ

  • How is inference different from training? Inference is continuous, latency-driven, and distributed; training is batch, throughput-driven, and centralized.
  • Where does inference run? Hyperscale clouds, colocation, edge sites, enterprise DCs, and embedded devices.
  • Do all inference workloads need GPUs? No — CPUs handle many small models; GPUs/ASICs are required for LLMs, vision, and high-QPS workloads.
  • How do fleets (cars/robots) stay up to date? Via OTA model updates from central training clusters, with telemetry feeding back upstream.
  • What’s the biggest bottleneck? Balancing latency, cost per query, and quality at global scale.