META 2024-08-05

Meta — A RoCE network for distributed AI training at scale (SIGCOMM 2024)¶

Summary¶

The SIGCOMM-2024 companion to Meta's 2024-06-12 training overview: an engineering deep-dive on the RoCE (RDMA over Converged Ethernet v2) backend fabric Meta built over four years of production evolution to host distributed AI training — up to the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training. The post (summarising the full ACM paper) covers three layers in order: topology (dedicated backend, two-stage Clos "AI Zone", aggregator layer), routing (ECMP → path-pinning → E-ECMP with QP scaling), and congestion control (DCQCN → PFC-only + receiver-driven admission co-designed with the NCCL collective library). The headline counterintuitive finding: Meta runs its 400G RoCE training fabric without DCQCN, relying on PFC for flow control and NCCL-level admission for congestion control, and reports stable performance over a year of operation.

Key takeaways¶

Dedicated backend (BE) training fabric, physically separated from the frontend (FE) data-ingest network. A training rack is connected to both: the FE carries data ingestion / checkpointing / logging (storage-warehouse traffic on a standard RSW→FSW→… hierarchy); the BE is a "specialised fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster." The BE uses RoCEv2 (RDMA-in-UDP). (Source text; see concepts/backend-frontend-network-separation and patterns/dedicated-backend-training-fabric.)
Two-stage Clos "AI Zone" topology — evolved from an early non-routable-RoCEv1 star. Leaf RTSW (rack training switch) connects GPUs intra-rack over copper DAC. Spine CTSW (cluster training switch) with deep buffers statically partitioned per port connects racks cluster-wide over single-mode fiber + 400G pluggable transceivers. Non-blocking inside the Zone. (Source text; systems/ai-zone.)
Aggregator Training Switch (ATSW) layer breaks the single-Zone cap, stitching multiple AI Zones together inside a data-center building to host LLM-scale jobs that overflow one Zone. Cross-Zone links are deliberately oversubscribed; the fabric uses ECMP across them — but to minimise the cross-Zone cost Meta extended the training scheduler to compute a "minimum cut" when splitting a job across Zones and to learn the logical topology position of each GPU server to recommend rank assignment. (Source text; patterns/minimum-cut-training-job-placement.)
AI training traffic has three pathological characteristics for Ethernet load balancing. (a) Low entropy — few, predictable, repetitive flows, unlike traditional DC workloads. (b) Burstiness — on/off at millisecond granularity. (c) Elephant flows — each burst can saturate NIC line rate. Together these break default ECMP — a 5-tuple hash spreading many small flows uniformly instead pins Meta's few long-lived flows to individual paths. (Source text; concepts/fat-flow-load-balancing.)
Path-pinning was the first routing response — and its failure mode was fragmentation. Meta deployed per-RTSW-downlink-index path-pinning, which "worked well if each rack was fully assigned to the same job and there was no failure in the network." In reality racks were partially job-allocated (only one of two hosts using the uplink), creating uneven uplink utilisation and 30%+ training-performance degradation from congestion. Network failures further unevenly reshuffled flows onto remaining CTSWs, cascading. Meta bandaged by 2×-overprovisioning RTSW uplinks (1:2 under-subscription) — explicitly named as a short-term fix because of the 2× capital cost. (Source text; concepts/path-pinning.)
Queue Pair scaling]] replaced path-pinning. Switches were reconfigured to do E-ECMP — adding the RoCE packet's destination QP field (via switch-ASIC UDF capability) to the ECMP hash tuple — and the collective library was changed to spread each hierarchical-collective message across multiple QPs to raise entropy. Two QP-scaling strategies were evaluated — split a message across multiple QPs (smaller fabric messages, multiple ACKs) vs round-robin each message onto a different QP — and round-robin won for NCCL message sizes seen in production. Result: up to 40% improvement for AllReduce over baseline ECMP. Caveat: still probabilistic hashing, and the QP-scaling factor has to be tuned per workload — long-term operational tax. (Source text.)
At 400G Meta turned DCQCN off. When moving from 200G to 400G deployments Meta tried to retune DCQCN (ECN-marking-based RDMA congestion control, doubled ECN thresholds) and "performance was degraded." Root cause was firmware-side DCQCN changes that introduced bugs and broke CNP-counting visibility. Meta deployed 400G with PFC only, no transport-level congestion control, and after "over a year of experience" observed stable performance and lack of persistent congestion for training collectives. This is a counterintuitive, industry-notable result — DCQCN is the storage-RDMA gold standard. (Source text.)
Congestion is instead controlled at the collective-library layer via receiver-driven traffic admission. NCCL's GPU-to-GPU communication uses two-stage copy and receiver-initiated transfers: the sender can only RDMA_WRITE after the receiver sends a clear-to-send (CTS) packet (carrying size + memory info). Meta leverages this handshake as admission control: in-flight network traffic is bounded by receivers' CTS issuance. Tuning levers: number of HBM channels (bounded by GPU-thread contention with compute) and channel buffer size (too small → bandwidth loss; too large → congestion spreading under RoCE's coarser flow control). Meta experimentally calibrated both across job sizes + collective types, and prioritized CTS packets at switches to prevent their delay starving the pipeline. (Source text; patterns/collective-library-transport-codesign.)
"We have not encountered a scenario … where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently." Meta's four-year finding: even with DCQCN off and RTSW→CTSW PFCs occasionally fired, the deep-buffer CTSW + NCCL-gated admission prevents congestion from becoming persistent. Explicit caveat: solution "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios" — they encourage further research. (Source text.)
Scheduler is topology-aware to reduce cross-Zone traffic. Because ATSW cross-Zone links are oversubscribed, the training-job scheduler computes a minimum-cut partition of training nodes into Zones, and learns each GPU's logical position to recommend rank assignments — so that collective-communication neighbours land in the same Zone when possible. This is a cluster-level expression of concepts/collective-communication-topology-awareness. (Source text; patterns/minimum-cut-training-job-placement.)

Systems / hardware extracted¶

systems/roce-rdma-over-converged-ethernet — the fabric Meta matured 4 years → 4K GPUs → 24K-GPU GenAI cluster → Llama 3.1 405B; this paper is the canonical deep-dive.
systems/ai-zone — Meta's new wiki page: two-stage Clos with RTSW/CTSW, non-blocking inside the Zone, 400G single-mode fiber at the spine.
systems/meta-genai-cluster-roce — the 24K-GPU cluster whose fabric this paper details.
systems/infiniband — the fabric explicitly contrasted throughout (IB has adaptive routing + richer collective offload; RoCE had to reach parity the hard way).
systems/llama-3-1 — Llama 3.1 405B is called out in the first paragraph as the workload this substrate supports.
systems/nvidia-h100 / systems/grand-teton — substrate inherited from the 2024-06-12 post.

Concepts extracted¶

New wiki pages:

concepts/rdma-queue-pair — the RDMA abstraction Meta hashes on to gain ECMP entropy.
concepts/ecmp-equal-cost-multipath — the Ethernet load-balancing primitive whose failure modes motivate the whole routing evolution.
concepts/enhanced-ecmp-qp-scaling — Meta's QP-aware hashing + collective-library-side QP-spreading combination, with a measured +40% AllReduce result.
concepts/dcqcn — the storage-focused RDMA CC Meta disabled at 400G.
concepts/priority-flow-control — the link-level lossless primitive that stayed, sufficient by itself in Meta's four-year experience.
concepts/path-pinning — the per-destination-slice routing scheme that failed under fragmented job placement + link failure.
concepts/backend-frontend-network-separation — physical-fabric split between training-traffic BE and data-plane FE, letting each evolve independently.
concepts/receiver-driven-traffic-admission — NCCL CTS-handshake reused as admission control.

Existing pages reinforced:

concepts/fat-flow-load-balancing — this paper quantifies Meta's fat-flow evolution (ECMP → 2× provisioning → path-pinning → E-ECMP+QP → QP round-robin) with numbers.
concepts/collective-communication-topology-awareness — the topology-aware rank placement logic (minimum-cut partitioning) is the scheduler-side expression of this concept.

Patterns extracted¶