DataCentersX > Compute Ops > Workload Scheduling

DC Workload Scheduling

Workload scheduling is the decision layer that places jobs on compute resources. Given a workload (a training job, an inference request, a batch job, a container, a VM) and a pool of resources (clusters, nodes, GPUs, CPUs, memory, network), the scheduler decides what runs where and when. The discipline is operationally distinct from Orchestration Operations, which carries out the placement decisions and manages workload lifecycle. Schedulers decide; orchestrators execute. Kubernetes is both - its kube-scheduler is the scheduling decision-maker, the rest of Kubernetes is orchestration - but the conceptual split exists in every workload management system.

Scheduling decisions

Decision	What it determines	Why it matters
Resource selection	Which cluster, node, or accelerator runs the workload	Hardware match, locality, capacity availability
Timing	When the workload runs vs queues	Throughput, deadline adherence, latency expectations
Priority and preemption	Which workloads run when capacity is contended	Fair-share allocation, business priority, SLA delivery
Bin-packing	How efficiently workloads pack onto available resources	Capacity utilization; stranded capacity reduction
Affinity and anti-affinity	Co-location preferences (data locality) and separation requirements (HA, fault domain isolation)	Network performance, fault tolerance
Gang scheduling	All-or-nothing placement of multi-node jobs (AI training)	Multi-GPU jobs need all resources simultaneously; partial placement wastes capacity
Topology awareness	Placement aware of NVLink domains, NIC affinity, NUMA topology	AI training performance depends on tight topology coupling
Cost and carbon awareness	Placement aware of spot pricing, region carbon intensity, off-peak scheduling	Cost optimization; sustainability; carbon-aware scheduling

Scheduler categories

Category	Workload type	Typical schedulers
Container scheduler	Containerized services, microservices, stateless workloads	Kubernetes scheduler, OpenShift scheduler, Mesos, Nomad
HPC batch scheduler	Tightly-coupled scientific computing, simulation	Slurm, PBS Pro, LSF, Grid Engine, Torque
AI training scheduler	Multi-node GPU training jobs	NVIDIA Run:ai, Volcano, Kueue, Slurm with AI extensions, Determined AI
Workflow scheduler	Batch pipelines with dependencies; ETL; data engineering	Apache Airflow, Dagster, Prefect, Argo Workflows
Big data scheduler	Hadoop and Spark clusters	YARN, Spark scheduler, Mesos
VM scheduler	Virtualized infrastructure	VMware DRS, OpenStack Nova scheduler, Hyper-V failover clusters
Hyperscaler internal	Custom workload management at fleet scale	Google Borg/Omega/Kubernetes, Meta Twine, Microsoft and AWS internal platforms
Cloud scheduler	Multi-tenant cloud workload placement	AWS, Azure, GCP internal placement engines (proprietary)

AI training scheduling

AI training workloads have created scheduling challenges that traditional Kubernetes and HPC schedulers were not designed for. A frontier training job consuming 10,000+ GPUs needs gang scheduling (all GPUs allocated simultaneously or none), tight topology awareness (placement on adjacent NVLink domains and rail-aligned NICs), preemption sophistication (low-priority jobs that can be killed when training jobs need capacity), and checkpoint-aware reschedule (resume on different nodes after node failure without losing training progress). These requirements drove the emergence of AI-specific schedulers - NVIDIA Run:ai (originally an Israeli startup, acquired by NVIDIA in 2024), Volcano (Kubernetes-native batch scheduler with gang scheduling), Kueue (Kubernetes job queuing), and Determined AI (Hewlett Packard Enterprise) all address this niche. The Kubernetes default scheduler does not handle gang scheduling natively; production AI training clusters typically run one of the AI-specific schedulers as an extension or replacement.

The shift from H100 to GB200 NVL72 has further complicated the picture. NVL72 racks are 72-GPU NVLink domains where memory is coherent across the full 72-GPU fabric; scheduling decisions need to keep tightly-coupled jobs within a single NVL72 domain when possible and across coupled domains when necessary. Rubin reference designs extend this further. Schedulers that were adequate for H100-scale topology are insufficient for NVL72 and Rubin deployments without explicit topology-awareness extensions.

Multi-tenancy and fair-share

Multi-tenant scheduling - allocating shared resources across multiple users, teams, or customers - is one of the harder scheduling problems. The classical algorithms include fair-share (Slurm, YARN), dominant resource fairness (DRF, used in Mesos), and various forms of weighted-fair queueing. Modern AI infrastructure has added the challenge that GPU allocation is the dominant resource, but jobs may also need specific CPU, memory, network bandwidth, and storage tiers. Hyperscaler internal schedulers solve this through complex multi-objective optimization that the open-source ecosystem is still catching up to. Cloud providers' GPU allocation visible-to-customer behavior - the way A100 capacity becomes harder to obtain at the end of quarters, or how Blackwell allocation goes to priority customers first - is the externally visible result of the internal scheduling decisions.

Carbon-aware scheduling

Carbon-aware scheduling shifts flexible workloads in time and across regions to align consumption with cleaner grid hours. Implementation requires hourly grid carbon intensity data (electricityMaps, WattTime, ENTSO-E), workload categorization (which jobs are flexible enough to delay or relocate), and scheduler integration that treats carbon intensity as a placement input. Google has published research on production deployment of carbon-aware scheduling; Microsoft, Meta, and several AI operators have similar capabilities. The technique works for batch and asynchronous workloads (training runs, batch inference, data processing) where latency is not critical; it does not apply to real-time inference. Carbon-aware scheduling is becoming a standard sustainability practice rather than a research curiosity, though the actual carbon savings depend on grid mix variability in the regions involved.

Spot and preemptible scheduling

Cloud providers offer spot pricing or preemptible instances - low-cost capacity that can be reclaimed by the provider on short notice. Scheduling for spot/preemptible workloads adds the challenge of handling reclamation events without losing work. The discipline includes spot-tolerant workload identification (training with frequent checkpointing, batch jobs that can resume), spot-bid optimization (predicting reclaim probability from spot price history), and graceful preemption handling (saving state when reclaim signal arrives, requeuing the work). The cost savings are substantial (40-90% off on-demand pricing) but realized only by workloads that can tolerate the operational complexity. AI training has become a major spot consumer because training jobs check pointing every few minutes can absorb reclamation events with limited cost.

Schedulers in depth

Scheduler	Origin	Distinctive
Kubernetes scheduler	Originally Borg-derived; CNCF	Default container scheduler; pluggable via scheduling framework; doesn't natively support gang scheduling
Slurm	SchedMD; HPC origins	Dominant HPC scheduler; mature gang scheduling; widely used in scientific computing and increasingly AI training
NVIDIA Run:ai	Israeli startup; acquired by NVIDIA 2024	GPU-focused fractional and shared scheduling; preemptive multi-tenancy; AI training optimized
Volcano	Open-source CNCF project	Kubernetes-native batch scheduler; gang scheduling; AI/ML workload focus
Kueue	Kubernetes SIG project	Job queueing for Kubernetes; gang scheduling; quota management
YARN	Hadoop ecosystem	Big data scheduling; capacity scheduler with hierarchical queues
PBS Pro	Altair	Mature HPC scheduler; common in research and government HPC
IBM LSF	IBM Spectrum Computing	Enterprise HPC scheduler; common in financial services and life sciences
Apache Airflow	Airbnb-originated; Apache Foundation	Workflow scheduler with dependency graphs; data engineering pipelines
Determined AI	Acquired by HPE	AI training platform with built-in scheduling and experiment management
Google Borg / Omega	Google internal	Borg paper (2015) defined modern fleet-scale scheduling; informed Kubernetes design
Meta Twine	Meta internal	Hyperscale fleet scheduler at Meta scale; not commercially available

Where this fits

Workload scheduling is the decision layer; Orchestration Operations is the execution layer that carries out the decisions. Both operate under Compute Ops but in different teams at scale. Scheduler decisions feed into capacity planning (and back from it), which connects to DCIM for capacity authority. Carbon-aware scheduling integrates with Energy:Sustainability practices. SLA delivery depends on scheduling decisions and feeds into SLA/SLO Management. AI training scheduling cross-references Workloads:AI Training and AI Training Superclusters.

Related coverage