Skip to content

CONCEPT Cited by 3 sources

Liquid-cooled AI rack

Definition

A liquid-cooled AI rack is a rack-scale compute unit where heat from the GPUs/accelerators is removed via a liquid coolant loop (water, dielectric fluid, or two-phase) rather than by forced-air convection. Liquid cooling becomes necessary at rack-level power envelopes above roughly ~30–50 kW — beyond the sustained air-cooling envelope even with aggressive CRAH and hot-aisle containment.

Why AI racks force the transition

Per-GPU TDP has risen from H100-class ~700 W to Blackwell-class ~1 kW+. At 36–72 GPUs per rack plus host CPUs, switches, and NICs, rack-level power envelopes exceed 100 kW — firmly past the air-cooling limit.

Meta's 2024-06 Grand Teton H100 training post explicitly named the cooling-infrastructure constraint:

"Since we did not have time to change the cooling infrastructure, we had to remain in an air-cooled environment. The mechanical design had to change… All of these hardware-related changes were challenging because we had to find a solution that fit within the existing resource constraints."

Four months later, Meta's 2024-10 OCP AI-hardware vision post announces the break: Catalina at 140 kW liquid-cooled, built on the ORv3 HPR chassis.

The power-density shift

Deployment Cooling Rack envelope
Dropbox 7th-gen storage (2025-08) Air ~16 kW
Meta Grand Teton H100 (2024-06) Air ~40 kW est.
Meta Catalina GB200 (2024-10) Liquid 140 kW

(Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware / sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale / sources/2024-10-15-meta-metas-open-ai-hardware-vision)

Infrastructure implications

Transitioning a data hall to liquid cooling is not incremental — it requires:

  • Coolant distribution units (CDUs) — pump + heat-exchanger units that interface with the facility chilled-water loop.
  • Rack manifolds — distribute coolant to each compute tray; contain leak-detection sensors.
  • Quick-disconnect fittings — allow hot-swap of compute trays without draining the loop.
  • Facility-level retrofits — chilled-water supply, leak containment, redundancy.

This is why Meta could not "just switch" for the H100 generation — the facility constraint had to be retired through capex-scale redesign, not firmware update.

Seen in

Last updated · 319 distilled / 1,201 read