Skip to content

SYSTEM Cited by 2 sources

Meta 24K GenAI Cluster — RoCE

One of two 24,000-GPU H100 clusters Meta built for Llama-3-era GenAI training. This one uses RoCE (RDMA over Converged Ethernet) as the inter-node fabric; its sibling uses InfiniBand. Meta trained the largest Llama 3 model on this RoCE cluster.

Why RoCE

Meta had four years of production RoCE experience, but only up to 4K-GPU clusters. They needed significantly larger RoCE clusters. The RoCE cluster was optimised for quick build time — the comparative advantage over InfiniBand was deployment speed leveraging existing Ethernet operational tooling.

"Meta had built RoCE clusters for the past four years, but the largest of those clusters only supported 4K GPUs. We needed significantly larger RoCE clusters." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

Profile

Attribute Value
GPU count 24,000 H100 (80 GB, HBM3, 700 W)
Platform Modified Grand Teton
Inter-node fabric RoCE
Optimised for Fast build time
Cooling Air
Hosted Largest Llama 3 training run

Three network optimisations (shared with the InfiniBand cluster)

Meta describes three stack-level optimisations applied to make network communication for GenAI workloads performant on both 24K clusters; these are especially load-bearing on RoCE, where ECMP-path hashing is the default:

  1. Parallelism-axis → topology-layer mapping. Communication patterns from different model / data / pipeline parallelisms are assigned to different layers of the network topology, so topology bandwidth is effectively exploited. (See concepts/3d-parallelism.)
  2. Topology-aware collectives. Default ring-based collectives were replaced with custom algorithms (e.g. recursive doubling / halving) that are less latency-sensitive. (See concepts/collective-communication-topology-awareness.)
  3. Fat-flow load balancing. GenAI training, like ranking jobs, produces fat flows that do not distribute across network paths via default ECMP. Meta invested further in network load balancing and routing to spread these across available paths. (See concepts/fat-flow-load-balancing; also Meta's Networking @Scale 2023 talk.)

Seen in (wiki)

Why the "build both" decision matters

Meta intentionally built both a RoCE 24K-GPU cluster and an InfiniBand 24K-GPU cluster, tuned each, and deployed Llama 3 training onto both. Both reached equivalent performance on the workload after tuning. This is an architectural-choice pattern in its own right — see patterns/build-both-fabric-alternatives. The decision defers a tradeoff evaluation that cannot be made by forecasting: build both at scale, learn operationally, carry forward the learnings.

Last updated · 319 distilled / 1,201 read