Skip to content

META 2024-08-05

Read original ↗

Meta — A RoCE network for distributed AI training at scale (SIGCOMM 2024)

Summary

The SIGCOMM-2024 companion to Meta's 2024-06-12 training overview: an engineering deep-dive on the RoCE (RDMA over Converged Ethernet v2) backend fabric Meta built over four years of production evolution to host distributed AI training — up to the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training. The post (summarising the full ACM paper) covers three layers in order: topology (dedicated backend, two-stage Clos "AI Zone", aggregator layer), routing (ECMP → path-pinning → E-ECMP with QP scaling), and congestion control (DCQCNPFC-only + receiver-driven admission co-designed with the NCCL collective library). The headline counterintuitive finding: Meta runs its 400G RoCE training fabric without DCQCN, relying on PFC for flow control and NCCL-level admission for congestion control, and reports stable performance over a year of operation.

Key takeaways

  1. Dedicated backend (BE) training fabric, physically separated from the frontend (FE) data-ingest network. A training rack is connected to both: the FE carries data ingestion / checkpointing / logging (storage-warehouse traffic on a standard RSW→FSW→… hierarchy); the BE is a "specialised fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster." The BE uses RoCEv2 (RDMA-in-UDP). (Source text; see concepts/backend-frontend-network-separation and patterns/dedicated-backend-training-fabric.)
  2. Two-stage Clos "AI Zone" topology — evolved from an early non-routable-RoCEv1 star. Leaf RTSW (rack training switch) connects GPUs intra-rack over copper DAC. Spine CTSW (cluster training switch) with deep buffers statically partitioned per port connects racks cluster-wide over single-mode fiber + 400G pluggable transceivers. Non-blocking inside the Zone. (Source text; systems/ai-zone.)
  3. Aggregator Training Switch (ATSW) layer breaks the single-Zone cap, stitching multiple AI Zones together inside a data-center building to host LLM-scale jobs that overflow one Zone. Cross-Zone links are deliberately oversubscribed; the fabric uses ECMP across them — but to minimise the cross-Zone cost Meta extended the training scheduler to compute a "minimum cut" when splitting a job across Zones and to learn the logical topology position of each GPU server to recommend rank assignment. (Source text; patterns/minimum-cut-training-job-placement.)
  4. AI training traffic has three pathological characteristics for Ethernet load balancing. (a) Low entropy — few, predictable, repetitive flows, unlike traditional DC workloads. (b) Burstiness — on/off at millisecond granularity. (c) Elephant flows — each burst can saturate NIC line rate. Together these break default ECMP — a 5-tuple hash spreading many small flows uniformly instead pins Meta's few long-lived flows to individual paths. (Source text; concepts/fat-flow-load-balancing.)
  5. Path-pinning was the first routing response — and its failure mode was fragmentation. Meta deployed per-RTSW-downlink-index path-pinning, which "worked well if each rack was fully assigned to the same job and there was no failure in the network." In reality racks were partially job-allocated (only one of two hosts using the uplink), creating uneven uplink utilisation and 30%+ training-performance degradation from congestion. Network failures further unevenly reshuffled flows onto remaining CTSWs, cascading. Meta bandaged by 2×-overprovisioning RTSW uplinks (1:2 under-subscription) — explicitly named as a short-term fix because of the 2× capital cost. (Source text; concepts/path-pinning.)
  6. Queue Pair scaling]] replaced path-pinning. Switches were reconfigured to do E-ECMP — adding the RoCE packet's destination QP field (via switch-ASIC UDF capability) to the ECMP hash tuple — and the collective library was changed to spread each hierarchical-collective message across multiple QPs to raise entropy. Two QP-scaling strategies were evaluated — split a message across multiple QPs (smaller fabric messages, multiple ACKs) vs round-robin each message onto a different QP — and round-robin won for NCCL message sizes seen in production. Result: up to 40% improvement for AllReduce over baseline ECMP. Caveat: still probabilistic hashing, and the QP-scaling factor has to be tuned per workload — long-term operational tax. (Source text.)
  7. At 400G Meta turned DCQCN off. When moving from 200G to 400G deployments Meta tried to retune DCQCN (ECN-marking-based RDMA congestion control, doubled ECN thresholds) and "performance was degraded." Root cause was firmware-side DCQCN changes that introduced bugs and broke CNP-counting visibility. Meta deployed 400G with PFC only, no transport-level congestion control, and after "over a year of experience" observed stable performance and lack of persistent congestion for training collectives. This is a counterintuitive, industry-notable result — DCQCN is the storage-RDMA gold standard. (Source text.)
  8. Congestion is instead controlled at the collective-library layer via receiver-driven traffic admission. NCCL's GPU-to-GPU communication uses two-stage copy and receiver-initiated transfers: the sender can only RDMA_WRITE after the receiver sends a clear-to-send (CTS) packet (carrying size + memory info). Meta leverages this handshake as admission control: in-flight network traffic is bounded by receivers' CTS issuance. Tuning levers: number of HBM channels (bounded by GPU-thread contention with compute) and channel buffer size (too small → bandwidth loss; too large → congestion spreading under RoCE's coarser flow control). Meta experimentally calibrated both across job sizes + collective types, and prioritized CTS packets at switches to prevent their delay starving the pipeline. (Source text; patterns/collective-library-transport-codesign.)
  9. "We have not encountered a scenario … where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently." Meta's four-year finding: even with DCQCN off and RTSW→CTSW PFCs occasionally fired, the deep-buffer CTSW + NCCL-gated admission prevents congestion from becoming persistent. Explicit caveat: solution "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios" — they encourage further research. (Source text.)
  10. Scheduler is topology-aware to reduce cross-Zone traffic. Because ATSW cross-Zone links are oversubscribed, the training-job scheduler computes a minimum-cut partition of training nodes into Zones, and learns each GPU's logical position to recommend rank assignments — so that collective-communication neighbours land in the same Zone when possible. This is a cluster-level expression of concepts/collective-communication-topology-awareness. (Source text; patterns/minimum-cut-training-job-placement.)

Systems / hardware extracted

Concepts extracted

New wiki pages:

Existing pages reinforced:

Patterns extracted

New wiki pages:

Existing pages reinforced:

  • patterns/build-both-fabric-alternatives — this paper is the operational-evidence side of that 2024-06-12 architectural bet. RoCE reached parity via explicit routing + CC co-design; InfiniBand via native adaptive routing.

Operational / architectural numbers

Datum Value Notes
Years of production RoCE 4+ at time of writing
Initial topology Star with central Ethernet switch, non-routable RoCEv1 deprecated
Current topology (AI Zone) Two-stage Clos — RTSW leaf + CTSW spine non-blocking inside Zone
Spine (CTSW) Modular, deep-buffered, ports statically partitioned
Intra-rack cabling Copper DAC (RTSW ↔ GPU)
Spine uplinks Single-mode fiber + 400G pluggable transceivers
Multi-Zone layer ATSW — Aggregator Training Switch, oversubscribed by design, ECMP across
RTSW uplink mitigation 2× bandwidth (1:2 under-subscription)short-term due to path-pinning fragmentation
Path-pinning failure penalty >30% training-performance degradation from uneven uplink utilisation
E-ECMP + QP scaling gain Up to +40% on AllReduce over baseline ECMP
QP scaling strategy choice Round-robin one message per QP (over split-across-QPs) for NCCL message sizes
DCQCN status at 400G OFF for >1 year
Congestion-control substrate PFC only + NCCL receiver-driven admission
Persistent PFC events observed None in ~4 years
CTS packet priority High-priority queuing at switches prevents notification bottleneck
Flagship workload Llama 3.1 405B training first-paragraph call-out

Caveats

  • DCQCN-off is Meta-specific. Meta explicitly says the solution "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios." Storage-RDMA workloads, smaller CTSW buffers, or different GPU:network ratios may still need DCQCN.
  • QP-scaling factor is workload-dependent. Meta names this as "long-term operational complexity" — each new collective / job shape may need retuning.
  • Minimum-cut scheduling + rank-assignment details are referenced but not disclosed; the paper promises more in the full ACM version.
  • Absolute goodput, MFU, tail-latency, and PFC-pause-duration numbers are not disclosed in the blog-post summary; for those see the full ACM paper linked at the bottom.
  • The fabric targets training specifically. Inference-serving RDMA needs are likely different (e.g. KV-cache transfer traffic shapes; see concepts/rdma-kv-transfer for Cloudflare's side of that problem).

Source

Last updated · 319 distilled / 1,201 read