DataCentersX > Stack > Cooling and Thermal Management > Direct-to-Chip Cooling


DC Direct-to-Chip Cooling


Direct-to-chip cooling, sometimes called direct liquid cooling or DLC, places a water-cooled cold plate in direct thermal contact with the integrated heat spreader of each CPU and accelerator package. Water flows through microchannels machined into the plate, picks up heat from the die, and returns to the Coolant Distribution Unit. The architecture exists because it is the shortest thermal path available at scale: die to TIM to cold plate to coolant, with no air in between.

DTC is the mainstream approach for current-generation AI accelerators. The NVIDIA GB200 NVL72 reference design is direct-to-chip. AMD MI300X and MI325X deployments are direct-to-chip. Intel Gaudi deployments at hyperscale are direct-to-chip. The entire frontier training stack has standardized on DTC as the equilibrium between thermal capacity, serviceability, and integration complexity. The density envelope DTC serves today spans roughly 50 to 250 kilowatts per rack, with next-generation reference designs pushing toward 600 kilowatts.


The thermal path

The physical stack between transistor junction and coolant in a DTC system has five conductive layers, each contributing resistance.

Layer Material Function
Silicon die Silicon Heat source at transistor junctions; typical junction temp 85 to 105 degrees C
TIM1 Solder, polymer TIM, or liquid metal Bonds die to integrated heat spreader inside the package
Integrated Heat Spreader (IHS) Copper, often nickel-plated Distributes heat from die across the package lid
TIM2 Thermal grease, pad, or phase-change material Couples IHS to cold plate baseplate
Cold plate Copper or copper alloy with machined microchannels Transfers heat from baseplate into flowing TCS water

The total thermal resistance from junction to coolant is the sum of resistances across these layers plus the convective resistance at the microchannel wall. Cold plate vendors compete primarily on two numbers: thermal resistance at a given flow rate, and pressure drop at that flow rate. Lower resistance lets silicon run cooler at the same coolant supply temperature; lower pressure drop lets the CDU pump less hard to maintain flow across a rack.


Cold plate designs

Cold plate architecture has converged on two dominant patterns with a third emerging for the highest-density accelerators.

Microchannel cold plates machine or etch parallel channels into a copper baseplate, typically 100 to 500 microns wide. Water flows through the channels under the die footprint, transferring heat through the thin metal wall. Simple to manufacture, well-characterized, dominant in current deployments.

Skived-fin and pin-fin cold plates increase surface area inside the flow path through fine vertical features. Pin-fin designs break up the boundary layer and raise heat transfer coefficient at the cost of higher pressure drop, which makes them preferred for the highest-dissipation packages where the extra pumping power is justified.

Manifold microchannel (3D) cold plates combine a vertical manifold with a microchannel layer, forcing coolant to enter and exit the channels at multiple points rather than flowing end-to-end. This shortens the hydraulic path, reduces pressure drop at high flow, and improves temperature uniformity across a large die. Emerging approach for packages dissipating above 1000 watts.


Thermal interface materials

The two TIM layers in the thermal stack are small contributors to absolute resistance but large contributors to variability. A poorly applied TIM layer with voids or thickness errors can add several degrees to junction temperature at full load, and TIM degradation over thermal cycling is a known reliability concern.

TIM2, the layer between package IHS and cold plate, is the one operators and system integrators select and apply. The dominant options are thermal greases (best performance, can pump out over time), thermal pads (easier to apply, higher bulk resistance), phase-change materials (melt on first heat cycle and fill voids, good performance), and at the extreme end liquid metal (best performance, requires careful containment to avoid corrosion of aluminum). Frontier AI reference designs have moved toward phase-change TIMs for the balance of performance and serviceability.

TIM1, the layer inside the package between die and IHS, is selected by the silicon vendor and is not field-serviceable. Its quality is a reliability parameter for the chip, not an operator decision.


The coverage gap

A cold plate on a CPU or accelerator package captures 75 to 85 percent of the rack's heat load. The remaining 15 to 25 percent is generated by components that either cannot accept a cold plate or have not been engineered to do so: DRAM modules, VRM power stages, retimers and PCIe switches, NICs and optics, NVMe drives, and in some designs the baseboard itself. That residual heat has to go somewhere, and every DTC deployment has to answer the question.

Approach Residual Heat Path Density Ceiling Typical Use
Hybrid air + DTC Chassis fans exhaust residual heat into the hot aisle; hall air handling removes it ~80 kW per rack before hall air becomes limiting Early DTC deployments and enterprise HPC
DTC + rear-door heat exchanger Chassis exhaust passes through water-cooled rear door before leaving rack ~150 kW per rack Current mainstream AI training racks
Extended cold plate coverage Cold plates extended to memory, VRMs, and retimers in addition to CPU and accelerator ~250 kW per rack Frontier AI reference designs (GB200 NVL72 class)
Full liquid chassis Every significant dissipator is liquid-cooled; chassis fans minimal or absent ~600 kW per rack, approaching immersion territory Next-generation Rubin-class and post-Rubin reference designs

The progression from hybrid to full liquid chassis is the industry's incremental path for absorbing density without switching to immersion. Each step extends cold-plate coverage to more components and reduces the residual air load correspondingly. The engineering question at the frontier is whether this progression can keep up with accelerator TDP growth, or whether immersion eventually wins by treating the whole board as a thermal volume rather than a collection of discrete dissipators.


Supply temperature and free cooling

A property of DTC that distinguishes it from air cooling is the supply temperature tolerance. Cold plates operate acceptably with TCS supply water in the 30 to 45 degrees C range, and ASHRAE W4 and W5 liquid-cooled classes allow supply temperatures up to 45 and 45+ degrees C respectively. Air-cooled systems need 18 to 24 degree C supply air, which in most climates requires mechanical chilling for a substantial portion of the year.

Raising the supply temperature past roughly 32 degrees C allows the facility water loop to be cooled by dry coolers or waterside economizers for most or all of the year, eliminating compressor energy. This is the mechanism by which DTC deployments reach PUE values below 1.1 at hyperscale: not because liquid is inherently more efficient than air, but because liquid supply can run hot enough to use ambient air as the heat sink.


Serviceability and operations

Servicing a DTC rack requires a different operational model than an air-cooled hall. Every server removal involves disconnecting two quick-disconnect fittings, every cold-plate replacement involves TIM application with clean-room discipline, and leak detection is a first-class monitoring concern rather than an afterthought. Facility staff training, spare parts inventory, and hot-swap procedures all expand to cover the liquid side of the stack.

The operational cost is offset by density: a single liquid-cooled rack can host what previously required five air-cooled racks, reducing floor space, cabling, and power distribution per unit of compute. For AI training at scale, the math has already resolved in liquid's favor. For lower-density workloads the operational overhead of DTC is not justified, and those rows remain on air.


Related coverage

Cooling and Thermal Management | Liquid Cooling | Immersion Cooling | HVAC and Air Handling | UPW and Cooling Water Systems | Chips and Silicon | Server Layer | Rack Layer | Cooling Monitoring