DataCentersX > Compute Ops > Workload Scheduling


DC Workload Scheduling


Workload scheduling is the decision layer that places jobs on compute resources. Given a workload (a training job, an inference request, a batch job, a container, a VM) and a pool of resources (clusters, nodes, GPUs, CPUs, memory, network), the scheduler decides what runs where and when. The discipline is operationally distinct from Orchestration Operations, which carries out the placement decisions and manages workload lifecycle. Schedulers decide; orchestrators execute. Kubernetes is both - its kube-scheduler is the scheduling decision-maker, the rest of Kubernetes is orchestration - but the conceptual split exists in every workload management system.


Scheduling decisions

Decision What it determines Why it matters
Resource selection Which cluster, node, or accelerator runs the workload Hardware match, locality, capacity availability
Timing When the workload runs vs queues Throughput, deadline adherence, latency expectations
Priority and preemption Which workloads run when capacity is contended Fair-share allocation, business priority, SLA delivery
Bin-packing How efficiently workloads pack onto available resources Capacity utilization; stranded capacity reduction
Affinity and anti-affinity Co-location preferences (data locality) and separation requirements (HA, fault domain isolation) Network performance, fault tolerance
Gang scheduling All-or-nothing placement of multi-node jobs (AI training) Multi-GPU jobs need all resources simultaneously; partial placement wastes capacity
Topology awareness Placement aware of NVLink domains, NIC affinity, NUMA topology AI training performance depends on tight topology coupling
Cost and carbon awareness Placement aware of spot pricing, region carbon intensity, off-peak scheduling Cost optimization; sustainability; carbon-aware scheduling

Scheduler categories

Category Workload type Typical schedulers
Container scheduler Containerized services, microservices, stateless workloads Kubernetes scheduler, OpenShift scheduler, Mesos, Nomad
HPC batch scheduler Tightly-coupled scientific computing, simulation Slurm, PBS Pro, LSF, Grid Engine, Torque
AI training scheduler Multi-node GPU training jobs NVIDIA Run:ai, Volcano, Kueue, Slurm with AI extensions, Determined AI
Workflow scheduler Batch pipelines with dependencies; ETL; data engineering Apache Airflow, Dagster, Prefect, Argo Workflows
Big data scheduler Hadoop and Spark clusters YARN, Spark scheduler, Mesos
VM scheduler Virtualized infrastructure VMware DRS, OpenStack Nova scheduler, Hyper-V failover clusters
Hyperscaler internal Custom workload management at fleet scale Google Borg/Omega/Kubernetes, Meta Twine, Microsoft and AWS internal platforms
Cloud scheduler Multi-tenant cloud workload placement AWS, Azure, GCP internal placement engines (proprietary)

AI training scheduling

AI training workloads have created scheduling challenges that traditional Kubernetes and HPC schedulers were not designed for. A frontier training job consuming 10,000+ GPUs needs gang scheduling (all GPUs allocated simultaneously or none), tight topology awareness (placement on adjacent NVLink domains and rail-aligned NICs), preemption sophistication (low-priority jobs that can be killed when training jobs need capacity), and checkpoint-aware reschedule (resume on different nodes after node failure without losing training progress). These requirements drove the emergence of AI-specific schedulers - NVIDIA Run:ai (originally an Israeli startup, acquired by NVIDIA in 2024), Volcano (Kubernetes-native batch scheduler with gang scheduling), Kueue (Kubernetes job queuing), and Determined AI (Hewlett Packard Enterprise) all address this niche. The Kubernetes default scheduler does not handle gang scheduling natively; production AI training clusters typically run one of the AI-specific schedulers as an extension or replacement.

The shift from H100 to GB200 NVL72 has further complicated the picture. NVL72 racks are 72-GPU NVLink domains where memory is coherent across the full 72-GPU fabric; scheduling decisions need to keep tightly-coupled jobs within a single NVL72 domain when possible and across coupled domains when necessary. Rubin reference designs extend this further. Schedulers that were adequate for H100-scale topology are insufficient for NVL72 and Rubin deployments without explicit topology-awareness extensions.


Multi-tenancy and fair-share

Multi-tenant scheduling - allocating shared resources across multiple users, teams, or customers - is one of the harder scheduling problems. The classical algorithms include fair-share (Slurm, YARN), dominant resource fairness (DRF, used in Mesos), and various forms of weighted-fair queueing. Modern AI infrastructure has added the challenge that GPU allocation is the dominant resource, but jobs may also need specific CPU, memory, network bandwidth, and storage tiers. Hyperscaler internal schedulers solve this through complex multi-objective optimization that the open-source ecosystem is still catching up to. Cloud providers' GPU allocation visible-to-customer behavior - the way A100 capacity becomes harder to obtain at the end of quarters, or how Blackwell allocation goes to priority customers first - is the externally visible result of the internal scheduling decisions.


Carbon-aware scheduling

Carbon-aware scheduling shifts flexible workloads in time and across regions to align consumption with cleaner grid hours. Implementation requires hourly grid carbon intensity data (electricityMaps, WattTime, ENTSO-E), workload categorization (which jobs are flexible enough to delay or relocate), and scheduler integration that treats carbon intensity as a placement input. Google has published research on production deployment of carbon-aware scheduling; Microsoft, Meta, and several AI operators have similar capabilities. The technique works for batch and asynchronous workloads (training runs, batch inference, data processing) where latency is not critical; it does not apply to real-time inference. Carbon-aware scheduling is becoming a standard sustainability practice rather than a research curiosity, though the actual carbon savings depend on grid mix variability in the regions involved.


Spot and preemptible scheduling

Cloud providers offer spot pricing or preemptible instances - low-cost capacity that can be reclaimed by the provider on short notice. Scheduling for spot/preemptible workloads adds the challenge of handling reclamation events without losing work. The discipline includes spot-tolerant workload identification (training with frequent checkpointing, batch jobs that can resume), spot-bid optimization (predicting reclaim probability from spot price history), and graceful preemption handling (saving state when reclaim signal arrives, requeuing the work). The cost savings are substantial (40-90% off on-demand pricing) but realized only by workloads that can tolerate the operational complexity. AI training has become a major spot consumer because training jobs check pointing every few minutes can absorb reclamation events with limited cost.


Schedulers in depth

Scheduler Origin Distinctive
Kubernetes scheduler Originally Borg-derived; CNCF Default container scheduler; pluggable via scheduling framework; doesn't natively support gang scheduling
Slurm SchedMD; HPC origins Dominant HPC scheduler; mature gang scheduling; widely used in scientific computing and increasingly AI training
NVIDIA Run:ai Israeli startup; acquired by NVIDIA 2024 GPU-focused fractional and shared scheduling; preemptive multi-tenancy; AI training optimized
Volcano Open-source CNCF project Kubernetes-native batch scheduler; gang scheduling; AI/ML workload focus
Kueue Kubernetes SIG project Job queueing for Kubernetes; gang scheduling; quota management
YARN Hadoop ecosystem Big data scheduling; capacity scheduler with hierarchical queues
PBS Pro Altair Mature HPC scheduler; common in research and government HPC
IBM LSF IBM Spectrum Computing Enterprise HPC scheduler; common in financial services and life sciences
Apache Airflow Airbnb-originated; Apache Foundation Workflow scheduler with dependency graphs; data engineering pipelines
Determined AI Acquired by HPE AI training platform with built-in scheduling and experiment management
Google Borg / Omega Google internal Borg paper (2015) defined modern fleet-scale scheduling; informed Kubernetes design
Meta Twine Meta internal Hyperscale fleet scheduler at Meta scale; not commercially available

Where this fits

Workload scheduling is the decision layer; Orchestration Operations is the execution layer that carries out the decisions. Both operate under Compute Ops but in different teams at scale. Scheduler decisions feed into capacity planning (and back from it), which connects to DCIM for capacity authority. Carbon-aware scheduling integrates with Energy:Sustainability practices. SLA delivery depends on scheduling decisions and feeds into SLA/SLO Management. AI training scheduling cross-references Workloads:AI Training and AI Training Superclusters.


Related coverage

Compute Ops | Orchestration Operations | SLA/SLO Management | Production Reliability Engineering | DCIM | AI Training | AI Training Superclusters | AI Inference | Energy: Sustainability