SYSTEM Cited by 2 sources
Meta 24K GenAI Cluster — RoCE¶
One of two 24,000-GPU H100 clusters Meta built for Llama-3-era GenAI training. This one uses RoCE (RDMA over Converged Ethernet) as the inter-node fabric; its sibling uses InfiniBand. Meta trained the largest Llama 3 model on this RoCE cluster.
Why RoCE¶
Meta had four years of production RoCE experience, but only up to 4K-GPU clusters. They needed significantly larger RoCE clusters. The RoCE cluster was optimised for quick build time — the comparative advantage over InfiniBand was deployment speed leveraging existing Ethernet operational tooling.
"Meta had built RoCE clusters for the past four years, but the largest of those clusters only supported 4K GPUs. We needed significantly larger RoCE clusters." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
Profile¶
| Attribute | Value |
|---|---|
| GPU count | 24,000 H100 (80 GB, HBM3, 700 W) |
| Platform | Modified Grand Teton |
| Inter-node fabric | RoCE |
| Optimised for | Fast build time |
| Cooling | Air |
| Hosted | Largest Llama 3 training run |
Three network optimisations (shared with the InfiniBand cluster)¶
Meta describes three stack-level optimisations applied to make network communication for GenAI workloads performant on both 24K clusters; these are especially load-bearing on RoCE, where ECMP-path hashing is the default:
- Parallelism-axis → topology-layer mapping. Communication patterns from different model / data / pipeline parallelisms are assigned to different layers of the network topology, so topology bandwidth is effectively exploited. (See concepts/3d-parallelism.)
- Topology-aware collectives. Default ring-based collectives were replaced with custom algorithms (e.g. recursive doubling / halving) that are less latency-sensitive. (See concepts/collective-communication-topology-awareness.)
- Fat-flow load balancing. GenAI training, like ranking jobs, produces fat flows that do not distribute across network paths via default ECMP. Meta invested further in network load balancing and routing to spread these across available paths. (See concepts/fat-flow-load-balancing; also Meta's Networking @Scale 2023 talk.)
Seen in (wiki)¶
- Meta — How Meta trains large language models at scale. Post is the canonical Meta statement on the RoCE cluster at 24K-GPU scale and its use for Llama 3 (incl. the largest model). (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
- Meta SIGCOMM 2024 RoCE paper. Engineering deep-dive on the fabric underneath this cluster — two-stage Clos AI Zone topology, routing evolution from baseline ECMP to E-ECMP + QP scaling (+40% AllReduce), congestion-control posture (DCQCN off, PFC + NCCL CTS admission), and topology-aware scheduling. Flagship workload is Llama 3.1 405B. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Why the "build both" decision matters¶
Meta intentionally built both a RoCE 24K-GPU cluster and an InfiniBand 24K-GPU cluster, tuned each, and deployed Llama 3 training onto both. Both reached equivalent performance on the workload after tuning. This is an architectural-choice pattern in its own right — see patterns/build-both-fabric-alternatives. The decision defers a tradeoff evaluation that cannot be made by forecasting: build both at scale, learn operationally, carry forward the learnings.
Related¶
- systems/meta-genai-cluster-infiniband — sibling cluster (same scale, InfiniBand fabric, bisection-bandwidth-optimised).
- systems/roce-rdma-over-converged-ethernet — the fabric technology.
- systems/ai-zone — the two-stage Clos topology template this cluster is built from.
- systems/nvidia-h100 — the GPU substrate.
- systems/grand-teton — the server platform.
- systems/llama-3 / systems/llama-3-1 — workloads; largest Llama 3 trained here; Llama 3.1 405B is the SIGCOMM 2024 flagship.
- patterns/build-both-fabric-alternatives — the architectural pattern expressed by this + sibling cluster.
- patterns/dedicated-backend-training-fabric / patterns/collective-library-transport-codesign / patterns/minimum-cut-training-job-placement — supporting architectural patterns.
- concepts/collective-communication-topology-awareness / concepts/fat-flow-load-balancing / concepts/enhanced-ecmp-qp-scaling / concepts/dcqcn / concepts/priority-flow-control / concepts/receiver-driven-traffic-admission / concepts/backend-frontend-network-separation — optimisations making the fabric performant.
- companies/meta