SYSTEM Cited by 1 source

Meta 24K GenAI Cluster — InfiniBand¶

One of two 24,000-GPU H100 clusters Meta built for Llama-3-era GenAI training. This one uses InfiniBand as the inter-node fabric; its sibling uses RoCE. Llama 3 was trained on both clusters.

Why InfiniBand¶

Meta had previously built the AI Research SuperCluster (RSC) — up to 16K GPUs on InfiniBand — but that fleet was not tightly integrated into Meta's production environment and was not built for the latest generation of GPUs/networking. The InfiniBand 24K-GPU cluster is the production-integrated, H100-generation evolution. It was optimised for full-bisection bandwidth — the comparative advantage over RoCE for this specific build.

"Meta had built research clusters with InfiniBand as large as 16K GPUs. However, those clusters were not tightly integrated into Meta's production environment, nor were they built for the latest generation of GPUs/networking." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Profile¶

Attribute	Value
GPU count	24,000 H100 (80 GB, HBM3, 700 W)
Platform	Modified Grand Teton
Inter-node fabric	InfiniBand
Optimised for	Full-bisection bandwidth
Cooling	Air
Prior Meta InfiniBand scale	16K GPUs (AI Research SuperCluster, non-production)
Hosted	Llama 3 training (among others)

Three network optimisations (shared with the RoCE cluster)¶

The same three stack-level optimisations applied to both 24K clusters. On InfiniBand specifically the fabric's built-in adaptive routing and hardware-offloaded collectives make fat-flow load balancing less critical than on RoCE, but all three were still applied to reach equivalent GenAI performance:

Parallelism-axis → topology-layer mapping.
Topology-aware collectives (recursive doubling / halving replacing default rings).
Fat-flow load balancing + routing.

(See concepts/collective-communication-topology-awareness, concepts/fat-flow-load-balancing.)

Seen in (wiki)¶

Meta — How Meta trains large language models at scale. Post names the InfiniBand cluster at 24K GPUs, the 16K-GPU research-cluster predecessor, and its co-use with the RoCE sibling for Llama 3. (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Why the "build both" decision matters¶

Meta's decision to build the InfiniBand cluster alongside a RoCE cluster at the same scale is the expression of patterns/build-both-fabric-alternatives — the tradeoff between the two fabrics was not decidable in advance of at-scale operational experience, so Meta built both and compared.

systems/meta-genai-cluster-roce — sibling cluster (same scale, RoCE fabric, build-time-optimised).
systems/infiniband — the fabric technology.
systems/nvidia-h100 / systems/grand-teton — substrate and platform.
systems/llama-3 — the canonical workload.
patterns/build-both-fabric-alternatives — the architectural pattern.
companies/meta