AI Inference Workloads
AI inference is the process of executing trained models to produce outputs — predictions, recommendations, classifications, or generations. Inference is latency-sensitive, cost-sensitive, and reliability-critical. Unlike training, which is centralized and batch-driven, inference is distributed across hyperscale data centers, colocation hubs, edge sites, and even embedded devices in vehicles and humanoid robots.
Overview
- Purpose: Serve trained models to end users and applications in real-time or near real-time.
- Scale: Billions of daily requests across search, recommendation, chat, and content platforms.
- Characteristics: Tight p95/p99 latency targets, high QPS throughput, cascading models, caching layers, cost per token/query.
- Comparison: Inference runs continuously at global scale, while training runs episodically in massive clusters.
Inference Modes
- Interactive APIs: Search, autocomplete, fraud checks — 20–200 ms latency budgets.
- Conversational / Generative: Chat, copilots, assistants — streaming token generation, sub-second TTFB.
- Batch scoring: Embedding generation, nightly ETL pipelines, recommendation refreshes — relaxed latency, throughput first.
- Edge inference: Autonomous vehicles, humanoid robots, AR/VR devices — sub-20 ms deterministic cycles, on-device accelerators.
Bill of Materials (BOM)
Domain |
Examples |
Role |
Accelerators |
NVIDIA L40S, H100/H200, AMD MI300A/X, Intel Gaudi, Edge TPUs, NPUs |
Token-by-token or batch inference at high throughput |
Serving Frameworks |
TensorRT-LLM, vLLM, Triton, ONNX Runtime |
Optimized model execution with batching and KV-cache support |
Retrieval & Context |
Vector DBs (Milvus, Weaviate, pgvector), Redis, Pinecone |
Enable RAG, personalization, and low-latency lookups |
Routers & Orchestrators |
Custom policy engines, model cascades, A/B controllers |
Select model variants by cost, latency, accuracy |
Networking |
RoCE/InfiniBand for GPU pools; 400G Ethernet for API edges |
Ensure low tail-latency and efficient batching |
Cooling |
Rear-door HX, liquid-cooled racks |
Manage dense inference clusters at 40–80 kW/rack |
Observability |
OpenTelemetry, token meters, quality eval pipelines |
Track SLOs, drift, and cost efficiency |
Facility Alignment
Inference Mode |
Best-Fit Facilities |
Also Runs In |
Notes |
Interactive APIs |
Hyperscale, Colocation |
Enterprise |
Global distribution, API latency targets |
Conversational / Agents |
Hyperscale |
Colo, Enterprise |
Streaming token output, context caching |
Batch Scoring |
Hyperscale, Enterprise |
HPC (co-located) |
Throughput priority, cost-optimized hardware |
Edge Realtime |
Edge / Micro |
Metro Colo |
Sub-20 ms cycles for robotics/autonomy |
Inference in Vehicles & Humanoids
Inference is increasingly embedded in autonomous vehicles and humanoid robots. These workloads are safety-critical, deterministic, and run entirely on-device, with upstream and downstream communication to training clusters:
Direction |
Flow |
Mechanism |
Purpose |
Upstream |
Telemetry, failure cases, sensor snapshots |
5G/LTE, Wi-Fi batch uploads |
Enrich training datasets with edge cases |
Downstream |
Model updates, patches, weights |
OTA updates via secure channels |
Deploy improved models to fleets |
Realtime (Optional) |
Map updates, V2X data |
Low-latency network sync |
Augment local inference without cloud dependency |
Key Challenges
- Latency: Meeting sub-100 ms targets for interactive APIs; sub-20 ms for autonomy/robotics.
- Cost per query: Serving billions of queries/token generations economically.
- Model scaling: Large models are expensive to serve; requires quantization, distillation, or cascades.
- Observability: Monitoring SLOs, drift, and bias at production scale.
- Security: Protecting model IP, PII, and integrity of OTA updates.
Notable Deployments
Deployment |
Operator |
Scale |
Notes |
ChatGPT API |
OpenAI / Microsoft Azure |
Global, billions of requests/day |
Conversational inference at hyperscale |
YouTube Recommendations |
Google |
Petabytes/day processed |
High-throughput recsys inference |
Tesla FSD |
Tesla/xAI |
Millions of cars |
On-device FSD inference with OTA model updates |
NVIDIA Jetson Inference |
NVIDIA |
Tens of thousands of edge robots/drones |
Embedded inference for robotics and automation |
Future Outlook
- Edge-first inference: Growth in NPUs and ASICs in cars, robots, and devices.
- Model optimization: Distillation, quantization, pruning to reduce costs.
- Federated inference: On-device predictions enhanced by fleet or cloud collaboration.
- Carbon-aware serving: Routing batch inference to cleaner grids/hours.
- Security-first OTA: Formal verification of model updates for safety-critical platforms.
FAQ
- How is inference different from training? Inference is continuous, latency-driven, and distributed; training is batch, throughput-driven, and centralized.
- Where does inference run? Hyperscale clouds, colocation, edge sites, enterprise DCs, and embedded devices.
- Do all inference workloads need GPUs? No — CPUs handle many small models; GPUs/ASICs are required for LLMs, vision, and high-QPS workloads.
- How do fleets (cars/robots) stay up to date? Via OTA model updates from central training clusters, with telemetry feeding back upstream.
- What’s the biggest bottleneck? Balancing latency, cost per query, and quality at global scale.