DataCentersX > Compute Ops > Workload Scheduling
DC Workload Scheduling
Workload scheduling is the decision layer that places jobs on compute resources. Given a workload (a training job, an inference request, a batch job, a container, a VM) and a pool of resources (clusters, nodes, GPUs, CPUs, memory, network), the scheduler decides what runs where and when. The discipline is operationally distinct from Orchestration Operations, which carries out the placement decisions and manages workload lifecycle. Schedulers decide; orchestrators execute. Kubernetes is both - its kube-scheduler is the scheduling decision-maker, the rest of Kubernetes is orchestration - but the conceptual split exists in every workload management system.
Scheduling decisions
| Decision | What it determines | Why it matters |
|---|---|---|
| Resource selection | Which cluster, node, or accelerator runs the workload | Hardware match, locality, capacity availability |
| Timing | When the workload runs vs queues | Throughput, deadline adherence, latency expectations |
| Priority and preemption | Which workloads run when capacity is contended | Fair-share allocation, business priority, SLA delivery |
| Bin-packing | How efficiently workloads pack onto available resources | Capacity utilization; stranded capacity reduction |
| Affinity and anti-affinity | Co-location preferences (data locality) and separation requirements (HA, fault domain isolation) | Network performance, fault tolerance |
| Gang scheduling | All-or-nothing placement of multi-node jobs (AI training) | Multi-GPU jobs need all resources simultaneously; partial placement wastes capacity |
| Topology awareness | Placement aware of NVLink domains, NIC affinity, NUMA topology | AI training performance depends on tight topology coupling |
| Cost and carbon awareness | Placement aware of spot pricing, region carbon intensity, off-peak scheduling | Cost optimization; sustainability; carbon-aware scheduling |
Scheduler categories
| Category | Workload type | Typical schedulers |
|---|---|---|
| Container scheduler | Containerized services, microservices, stateless workloads | Kubernetes scheduler, OpenShift scheduler, Mesos, Nomad |
| HPC batch scheduler | Tightly-coupled scientific computing, simulation | Slurm, PBS Pro, LSF, Grid Engine, Torque |
| AI training scheduler | Multi-node GPU training jobs | NVIDIA Run:ai, Volcano, Kueue, Slurm with AI extensions, Determined AI |
| Workflow scheduler | Batch pipelines with dependencies; ETL; data engineering | Apache Airflow, Dagster, Prefect, Argo Workflows |
| Big data scheduler | Hadoop and Spark clusters | YARN, Spark scheduler, Mesos |
| VM scheduler | Virtualized infrastructure | VMware DRS, OpenStack Nova scheduler, Hyper-V failover clusters |
| Hyperscaler internal | Custom workload management at fleet scale | Google Borg/Omega/Kubernetes, Meta Twine, Microsoft and AWS internal platforms |
| Cloud scheduler | Multi-tenant cloud workload placement | AWS, Azure, GCP internal placement engines (proprietary) |
AI training scheduling
AI training workloads have created scheduling challenges that traditional Kubernetes and HPC schedulers were not designed for. A frontier training job consuming 10,000+ GPUs needs gang scheduling (all GPUs allocated simultaneously or none), tight topology awareness (placement on adjacent NVLink domains and rail-aligned NICs), preemption sophistication (low-priority jobs that can be killed when training jobs need capacity), and checkpoint-aware reschedule (resume on different nodes after node failure without losing training progress). These requirements drove the emergence of AI-specific schedulers - NVIDIA Run:ai (originally an Israeli startup, acquired by NVIDIA in 2024), Volcano (Kubernetes-native batch scheduler with gang scheduling), Kueue (Kubernetes job queuing), and Determined AI (Hewlett Packard Enterprise) all address this niche. The Kubernetes default scheduler does not handle gang scheduling natively; production AI training clusters typically run one of the AI-specific schedulers as an extension or replacement.
The shift from H100 to GB200 NVL72 has further complicated the picture. NVL72 racks are 72-GPU NVLink domains where memory is coherent across the full 72-GPU fabric; scheduling decisions need to keep tightly-coupled jobs within a single NVL72 domain when possible and across coupled domains when necessary. Rubin reference designs extend this further. Schedulers that were adequate for H100-scale topology are insufficient for NVL72 and Rubin deployments without explicit topology-awareness extensions.
Multi-tenancy and fair-share
Multi-tenant scheduling - allocating shared resources across multiple users, teams, or customers - is one of the harder scheduling problems. The classical algorithms include fair-share (Slurm, YARN), dominant resource fairness (DRF, used in Mesos), and various forms of weighted-fair queueing. Modern AI infrastructure has added the challenge that GPU allocation is the dominant resource, but jobs may also need specific CPU, memory, network bandwidth, and storage tiers. Hyperscaler internal schedulers solve this through complex multi-objective optimization that the open-source ecosystem is still catching up to. Cloud providers' GPU allocation visible-to-customer behavior - the way A100 capacity becomes harder to obtain at the end of quarters, or how Blackwell allocation goes to priority customers first - is the externally visible result of the internal scheduling decisions.
Carbon-aware scheduling
Carbon-aware scheduling shifts flexible workloads in time and across regions to align consumption with cleaner grid hours. Implementation requires hourly grid carbon intensity data (electricityMaps, WattTime, ENTSO-E), workload categorization (which jobs are flexible enough to delay or relocate), and scheduler integration that treats carbon intensity as a placement input. Google has published research on production deployment of carbon-aware scheduling; Microsoft, Meta, and several AI operators have similar capabilities. The technique works for batch and asynchronous workloads (training runs, batch inference, data processing) where latency is not critical; it does not apply to real-time inference. Carbon-aware scheduling is becoming a standard sustainability practice rather than a research curiosity, though the actual carbon savings depend on grid mix variability in the regions involved.
Spot and preemptible scheduling
Cloud providers offer spot pricing or preemptible instances - low-cost capacity that can be reclaimed by the provider on short notice. Scheduling for spot/preemptible workloads adds the challenge of handling reclamation events without losing work. The discipline includes spot-tolerant workload identification (training with frequent checkpointing, batch jobs that can resume), spot-bid optimization (predicting reclaim probability from spot price history), and graceful preemption handling (saving state when reclaim signal arrives, requeuing the work). The cost savings are substantial (40-90% off on-demand pricing) but realized only by workloads that can tolerate the operational complexity. AI training has become a major spot consumer because training jobs check pointing every few minutes can absorb reclamation events with limited cost.
Schedulers in depth
| Scheduler | Origin | Distinctive |
|---|---|---|
| Kubernetes scheduler | Originally Borg-derived; CNCF | Default container scheduler; pluggable via scheduling framework; doesn't natively support gang scheduling |
| Slurm | SchedMD; HPC origins | Dominant HPC scheduler; mature gang scheduling; widely used in scientific computing and increasingly AI training |
| NVIDIA Run:ai | Israeli startup; acquired by NVIDIA 2024 | GPU-focused fractional and shared scheduling; preemptive multi-tenancy; AI training optimized |
| Volcano | Open-source CNCF project | Kubernetes-native batch scheduler; gang scheduling; AI/ML workload focus |
| Kueue | Kubernetes SIG project | Job queueing for Kubernetes; gang scheduling; quota management |
| YARN | Hadoop ecosystem | Big data scheduling; capacity scheduler with hierarchical queues |
| PBS Pro | Altair | Mature HPC scheduler; common in research and government HPC |
| IBM LSF | IBM Spectrum Computing | Enterprise HPC scheduler; common in financial services and life sciences |
| Apache Airflow | Airbnb-originated; Apache Foundation | Workflow scheduler with dependency graphs; data engineering pipelines |
| Determined AI | Acquired by HPE | AI training platform with built-in scheduling and experiment management |
| Google Borg / Omega | Google internal | Borg paper (2015) defined modern fleet-scale scheduling; informed Kubernetes design |
| Meta Twine | Meta internal | Hyperscale fleet scheduler at Meta scale; not commercially available |
Where this fits
Workload scheduling is the decision layer; Orchestration Operations is the execution layer that carries out the decisions. Both operate under Compute Ops but in different teams at scale. Scheduler decisions feed into capacity planning (and back from it), which connects to DCIM for capacity authority. Carbon-aware scheduling integrates with Energy:Sustainability practices. SLA delivery depends on scheduling decisions and feeds into SLA/SLO Management. AI training scheduling cross-references Workloads:AI Training and AI Training Superclusters.
Related coverage
Compute Ops | Orchestration Operations | SLA/SLO Management | Production Reliability Engineering | DCIM | AI Training | AI Training Superclusters | AI Inference | Energy: Sustainability