Meta — A RoCE network for distributed AI training at scale (SIGCOMM 2024)¶
Summary¶
The SIGCOMM-2024 companion to Meta's 2024-06-12 training overview: an engineering deep-dive on the RoCE (RDMA over Converged Ethernet v2) backend fabric Meta built over four years of production evolution to host distributed AI training — up to the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training. The post (summarising the full ACM paper) covers three layers in order: topology (dedicated backend, two-stage Clos "AI Zone", aggregator layer), routing (ECMP → path-pinning → E-ECMP with QP scaling), and congestion control (DCQCN → PFC-only + receiver-driven admission co-designed with the NCCL collective library). The headline counterintuitive finding: Meta runs its 400G RoCE training fabric without DCQCN, relying on PFC for flow control and NCCL-level admission for congestion control, and reports stable performance over a year of operation.
Key takeaways¶
- Dedicated backend (BE) training fabric, physically separated from the frontend (FE) data-ingest network. A training rack is connected to both: the FE carries data ingestion / checkpointing / logging (storage-warehouse traffic on a standard RSW→FSW→… hierarchy); the BE is a "specialised fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster." The BE uses RoCEv2 (RDMA-in-UDP). (Source text; see concepts/backend-frontend-network-separation and patterns/dedicated-backend-training-fabric.)
- Two-stage Clos "AI Zone" topology — evolved from an early non-routable-RoCEv1 star. Leaf RTSW (rack training switch) connects GPUs intra-rack over copper DAC. Spine CTSW (cluster training switch) with deep buffers statically partitioned per port connects racks cluster-wide over single-mode fiber + 400G pluggable transceivers. Non-blocking inside the Zone. (Source text; systems/ai-zone.)
- Aggregator Training Switch (ATSW) layer breaks the single-Zone cap, stitching multiple AI Zones together inside a data-center building to host LLM-scale jobs that overflow one Zone. Cross-Zone links are deliberately oversubscribed; the fabric uses ECMP across them — but to minimise the cross-Zone cost Meta extended the training scheduler to compute a "minimum cut" when splitting a job across Zones and to learn the logical topology position of each GPU server to recommend rank assignment. (Source text; patterns/minimum-cut-training-job-placement.)
- AI training traffic has three pathological characteristics for Ethernet load balancing. (a) Low entropy — few, predictable, repetitive flows, unlike traditional DC workloads. (b) Burstiness — on/off at millisecond granularity. (c) Elephant flows — each burst can saturate NIC line rate. Together these break default ECMP — a 5-tuple hash spreading many small flows uniformly instead pins Meta's few long-lived flows to individual paths. (Source text; concepts/fat-flow-load-balancing.)
- Path-pinning was the first routing response — and its failure mode was fragmentation. Meta deployed per-RTSW-downlink-index path-pinning, which "worked well if each rack was fully assigned to the same job and there was no failure in the network." In reality racks were partially job-allocated (only one of two hosts using the uplink), creating uneven uplink utilisation and 30%+ training-performance degradation from congestion. Network failures further unevenly reshuffled flows onto remaining CTSWs, cascading. Meta bandaged by 2×-overprovisioning RTSW uplinks (1:2 under-subscription) — explicitly named as a short-term fix because of the 2× capital cost. (Source text; concepts/path-pinning.)
- Queue Pair scaling]] replaced path-pinning. Switches were reconfigured to do E-ECMP — adding the RoCE packet's destination QP field (via switch-ASIC UDF capability) to the ECMP hash tuple — and the collective library was changed to spread each hierarchical-collective message across multiple QPs to raise entropy. Two QP-scaling strategies were evaluated — split a message across multiple QPs (smaller fabric messages, multiple ACKs) vs round-robin each message onto a different QP — and round-robin won for NCCL message sizes seen in production. Result: up to 40% improvement for AllReduce over baseline ECMP. Caveat: still probabilistic hashing, and the QP-scaling factor has to be tuned per workload — long-term operational tax. (Source text.)
- At 400G Meta turned DCQCN off. When moving from 200G to 400G deployments Meta tried to retune DCQCN (ECN-marking-based RDMA congestion control, doubled ECN thresholds) and "performance was degraded." Root cause was firmware-side DCQCN changes that introduced bugs and broke CNP-counting visibility. Meta deployed 400G with PFC only, no transport-level congestion control, and after "over a year of experience" observed stable performance and lack of persistent congestion for training collectives. This is a counterintuitive, industry-notable result — DCQCN is the storage-RDMA gold standard. (Source text.)
- Congestion is instead controlled at the collective-library layer via receiver-driven traffic admission. NCCL's GPU-to-GPU communication uses two-stage copy and receiver-initiated transfers: the sender can only
RDMA_WRITEafter the receiver sends a clear-to-send (CTS) packet (carrying size + memory info). Meta leverages this handshake as admission control: in-flight network traffic is bounded by receivers' CTS issuance. Tuning levers: number of HBM channels (bounded by GPU-thread contention with compute) and channel buffer size (too small → bandwidth loss; too large → congestion spreading under RoCE's coarser flow control). Meta experimentally calibrated both across job sizes + collective types, and prioritized CTS packets at switches to prevent their delay starving the pipeline. (Source text; patterns/collective-library-transport-codesign.) - "We have not encountered a scenario … where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently." Meta's four-year finding: even with DCQCN off and RTSW→CTSW PFCs occasionally fired, the deep-buffer CTSW + NCCL-gated admission prevents congestion from becoming persistent. Explicit caveat: solution "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios" — they encourage further research. (Source text.)
- Scheduler is topology-aware to reduce cross-Zone traffic. Because ATSW cross-Zone links are oversubscribed, the training-job scheduler computes a minimum-cut partition of training nodes into Zones, and learns each GPU's logical position to recommend rank assignments — so that collective-communication neighbours land in the same Zone when possible. This is a cluster-level expression of concepts/collective-communication-topology-awareness. (Source text; patterns/minimum-cut-training-job-placement.)
Systems / hardware extracted¶
- systems/roce-rdma-over-converged-ethernet — the fabric Meta matured 4 years → 4K GPUs → 24K-GPU GenAI cluster → Llama 3.1 405B; this paper is the canonical deep-dive.
- systems/ai-zone — Meta's new wiki page: two-stage Clos with RTSW/CTSW, non-blocking inside the Zone, 400G single-mode fiber at the spine.
- systems/meta-genai-cluster-roce — the 24K-GPU cluster whose fabric this paper details.
- systems/infiniband — the fabric explicitly contrasted throughout (IB has adaptive routing + richer collective offload; RoCE had to reach parity the hard way).
- systems/llama-3-1 — Llama 3.1 405B is called out in the first paragraph as the workload this substrate supports.
- systems/nvidia-h100 / systems/grand-teton — substrate inherited from the 2024-06-12 post.
Concepts extracted¶
New wiki pages:
- concepts/rdma-queue-pair — the RDMA abstraction Meta hashes on to gain ECMP entropy.
- concepts/ecmp-equal-cost-multipath — the Ethernet load-balancing primitive whose failure modes motivate the whole routing evolution.
- concepts/enhanced-ecmp-qp-scaling — Meta's QP-aware hashing + collective-library-side QP-spreading combination, with a measured +40% AllReduce result.
- concepts/dcqcn — the storage-focused RDMA CC Meta disabled at 400G.
- concepts/priority-flow-control — the link-level lossless primitive that stayed, sufficient by itself in Meta's four-year experience.
- concepts/path-pinning — the per-destination-slice routing scheme that failed under fragmented job placement + link failure.
- concepts/backend-frontend-network-separation — physical-fabric split between training-traffic BE and data-plane FE, letting each evolve independently.
- concepts/receiver-driven-traffic-admission — NCCL CTS-handshake reused as admission control.
Existing pages reinforced:
- concepts/fat-flow-load-balancing — this paper quantifies Meta's fat-flow evolution (ECMP → 2× provisioning → path-pinning → E-ECMP+QP → QP round-robin) with numbers.
- concepts/collective-communication-topology-awareness — the topology-aware rank placement logic (minimum-cut partitioning) is the scheduler-side expression of this concept.
Patterns extracted¶
New wiki pages:
- patterns/dedicated-backend-training-fabric — physically separate the training fabric from the general DC network so it can evolve, operate, and scale on its own schedule.
- patterns/collective-library-transport-codesign — co-design the collective-comm library (NCCL) and the transport (RoCE CC, switch QoS, admission control) because neither is sufficient alone.
- patterns/minimum-cut-training-job-placement — make the training-job scheduler topology-aware so that expensive oversubscribed links are crossed as little as possible.
Existing pages reinforced:
- patterns/build-both-fabric-alternatives — this paper is the operational-evidence side of that 2024-06-12 architectural bet. RoCE reached parity via explicit routing + CC co-design; InfiniBand via native adaptive routing.
Operational / architectural numbers¶
| Datum | Value | Notes |
|---|---|---|
| Years of production RoCE | 4+ | at time of writing |
| Initial topology | Star with central Ethernet switch, non-routable RoCEv1 | deprecated |
| Current topology (AI Zone) | Two-stage Clos — RTSW leaf + CTSW spine | non-blocking inside Zone |
| Spine (CTSW) | Modular, deep-buffered, ports statically partitioned | |
| Intra-rack cabling | Copper DAC (RTSW ↔ GPU) | |
| Spine uplinks | Single-mode fiber + 400G pluggable transceivers | |
| Multi-Zone layer | ATSW — Aggregator Training Switch, oversubscribed by design, ECMP across | |
| RTSW uplink mitigation | 2× bandwidth (1:2 under-subscription) — short-term | due to path-pinning fragmentation |
| Path-pinning failure penalty | >30% training-performance degradation | from uneven uplink utilisation |
| E-ECMP + QP scaling gain | Up to +40% on AllReduce | over baseline ECMP |
| QP scaling strategy choice | Round-robin one message per QP (over split-across-QPs) | for NCCL message sizes |
| DCQCN status at 400G | OFF | for >1 year |
| Congestion-control substrate | PFC only + NCCL receiver-driven admission | |
| Persistent PFC events observed | None in ~4 years | |
| CTS packet priority | High-priority queuing at switches | prevents notification bottleneck |
| Flagship workload | Llama 3.1 405B training | first-paragraph call-out |
Caveats¶
- DCQCN-off is Meta-specific. Meta explicitly says the solution "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios." Storage-RDMA workloads, smaller CTSW buffers, or different GPU:network ratios may still need DCQCN.
- QP-scaling factor is workload-dependent. Meta names this as "long-term operational complexity" — each new collective / job shape may need retuning.
- Minimum-cut scheduling + rank-assignment details are referenced but not disclosed; the paper promises more in the full ACM version.
- Absolute goodput, MFU, tail-latency, and PFC-pause-duration numbers are not disclosed in the blog-post summary; for those see the full ACM paper linked at the bottom.
- The fabric targets training specifically. Inference-serving RDMA needs are likely different (e.g. KV-cache transfer traffic shapes; see concepts/rdma-kv-transfer for Cloudflare's side of that problem).
Source¶
- Original: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/
- Raw markdown:
raw/meta/2024-08-05-a-roce-network-for-distributed-ai-training-at-scale-5a14cfa7.md - Full ACM paper: RDMA over Ethernet for Distributed AI Training at Meta Scale (SIGCOMM 2024)
- HN discussion: https://news.ycombinator.com/item?id=41162664 (81 points)
Related¶
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the higher-level training-substrate tour this paper is the fabric-layer deep-dive for.
- companies/meta — Meta company page.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two 24K clusters; this paper focuses on the RoCE side.