Skip to content

PATTERN Cited by 1 source

Data-center density optimization for GPU clusters

Context

At GPU-cluster scale (~10K+ GPUs), the bottleneck shifts from compute silicon and fabric switches to the data hall itself: power capacity, cooling capacity, physical square footage, and — critically — how tightly the GPU racks can be packed into a single coherent network cluster.

Unlike silicon lead time (months), data-center power and cooling infrastructure upgrades are measured in years. You cannot add megawatts of cooling in response to a new GPU SKU; you must fit the SKU into the existing envelope, or wait years for a purpose-built build-out.

Meta's framing:

"Data center power and cooling infrastructure cannot be changed quickly (or easily) and we had to find an optimal layout that allowed maximum compute capability within a data hall." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Pattern

When power and cooling are fixed, maximise compute density within the data hall by aggressively rearranging what is inside it:

  1. Evict non-compute services from the data hall. Meta names readers specifically: "This required relocating supporting services such as readers out of the data hall." Readers occupy power and space that GPUs could use.
  2. Pack GPU racks maximally within one network cluster. "Packing as many GPU racks as possible to maximize the power and network capability for highest compute density with the largest network cluster." A single network cluster is a strictly-more-useful unit than two half-size clusters — parallel training workloads that span two clusters pay an inter-cluster-fabric penalty.
  3. Accept mechanical/thermal design constraints as load-bearing. The H100-at-700W platform (Grand Teton modified) was kept on air cooling because cooling infrastructure couldn't be changed in time. The pattern is: change everything inside the fixed-power-and-cooling envelope; change the envelope on a longer timeline.
  4. Treat the validation cycle as load-bearing. Mechanical/thermal redesign at scale requires a validation cycle before fleet deployment. Meta explicitly calls this out.

When to use it

  • Existing data-center footprint where infrastructure changes are 2-3+ year lead-time.
  • New GPU generation with power/thermal requirements exceeding the previous envelope.
  • Network-cluster coherence matters — your workloads are jobs that span the full cluster (LLM training, large-batch inference), not many independent tenants.

When not to use it

  • Greenfield data-center build — design the envelope around the GPU, not the other way round.
  • Small GPU fleets — density optimisation at rack scale is not the bottleneck.
  • Heterogeneous tenancy — if your data hall hosts many unrelated workloads, evicting them all to pack GPUs may not be the right tradeoff.

Tradeoffs this pattern accepts

  • Higher thermal densityhigher hardware failure rate (especially early-life failures). Meta accepts this because failure-remediation automation and spare capacity are already part of their reliability story.
  • Less room for future-proofing. Packing to the limit means no headroom for the next GPU generation's power draw — which is why Meta frames air cooling as a transitional constraint.
  • Capital-scheduling rigidity. Evicting readers assumes you have somewhere to put them; this is an org-wide capacity problem.

Adjacent patterns

  • Rack-level power-density redesign (see concepts/rack-level-power-density) — the cousin concept on the rack-internal side.
  • Liquid cooling retrofit — the eventual solution to the 700 W+ per-GPU thermal problem; this pattern is the interim answer when cooling can't change in time.

Wiki instances

Last updated · 319 distilled / 1,201 read