SYSTEM Cited by 2 sources
RoCE (RDMA over Converged Ethernet)¶
RoCE is the standard for running RDMA over an Ethernet fabric — same RDMA semantics (kernel bypass, zero-copy, reliable or unreliable connected/datagram transports) as InfiniBand, but riding the Ethernet transport that hyperscalers already operate at scale. There are two variants — RoCEv1 (link-local) and RoCEv2 (routable over IP, production-dominant).
In the context of this wiki, RoCE shows up as the Ethernet-native alternative to InfiniBand for large AI training clusters: an operational-ecosystem choice more than a technical one, since ECMP + congestion control on Ethernet must be tuned to reach parity on GenAI workloads.
Seen in (wiki)¶
- Meta 24K-GPU GenAI cluster (2024). Meta had four years of RoCE production experience (up to 4K GPUs) and made it the fabric for one of two 24K-GPU H100 clusters — optimised for fast build time, and the one on which the largest Llama 3 model was trained. Meta tuned the RoCE cluster to equivalent performance with the InfiniBand sibling on GenAI workloads via a three-part optimisation (parallelism-axis → topology-layer mapping, topology-aware collectives, fat-flow load balancing). (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale; see systems/meta-genai-cluster-roce)
- Meta SIGCOMM 2024 RoCE paper (2024-08-05). Full engineering deep-dive on Meta's RoCE fabric for training at scale, supporting Llama 3.1 405B. Describes the two-stage Clos AI Zone topology (RTSW leaf + CTSW spine + optional ATSW aggregator), the routing evolution (baseline ECMP → concepts/path-pinning → E-ECMP + QP scaling with +40% on AllReduce), and the surprising congestion-control posture: DCQCN off at 400G for a year, PFC-only + NCCL-level receiver-driven admission instead. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
RoCE vs InfiniBand — the tradeoff Meta explicitly framed¶
| RoCE | InfiniBand | |
|---|---|---|
| Transport | Ethernet | HPC-native |
| Operational tooling at hyperscale | Familiar (reused Ethernet practice) | Specialised |
| Build-speed advantage | Yes (Meta's 2024 framing) | No |
| Full-bisection-bandwidth framing | Had to be designed in | Native strength |
| Load balancing for fat flows | Requires explicit routing / LB work | Adaptive routing is built-in |
| Collective offload | NIC-dependent | Usually richer |
| Meta's 2024 decision | Build one 24K-GPU cluster | Build a second 24K-GPU cluster |
See patterns/build-both-fabric-alternatives for the architectural pattern this tradeoff motivates.
Why fat flows are the key RoCE failure mode for training¶
Default Ethernet ECMP hashing assumes flows are many and short-lived; a tuple of source/dest IP/port/proto is enough to spread traffic. LLM training produces the opposite: a small number of very large, long-lived tensor transfers, which hash to a single path and saturate it while other paths stand idle. Meta names this explicitly — see concepts/fat-flow-load-balancing. Meta's Networking @Scale 2023 talk goes deeper.
Stub¶
More content to add as more sources (AWS EFA, Microsoft, Broadcom AI-Ethernet, Ultra Ethernet Consortium) come into the wiki. For now: the canonical wiki references are Meta's 2024-06-12 post and 2024-08-05 SIGCOMM paper.
Canonical Meta engineering stack (post SIGCOMM 2024)¶
The 2024-08-05 paper crystallises the end-state stack Meta runs on 400G RoCE training clusters:
| Layer | Choice | Notes |
|---|---|---|
| Topology | Two-stage Clos AI Zone | RTSW leaf + CTSW spine; non-blocking inside Zone |
| Multi-Zone | ATSW aggregator, oversubscribed | patterns/minimum-cut-training-job-placement handles it |
| Physical / FE-BE | Dedicated backend network | Per patterns/dedicated-backend-training-fabric |
| Routing | E-ECMP + QP scaling | +40% AllReduce over baseline ECMP |
| Transport CC | DCQCN OFF | Firmware issues at 400G, redundant given admission control |
| Link CC | PFC | Sufficient with deep-buffer CTSW |
| Admission | NCCL CTS handshake | CTS high-priority-queued at switches |
| Scheduler | Topology-aware minimum-cut | Reduces cross-Zone traffic |
This is a canonical instance of patterns/collective-library-transport-codesign.
Related¶
- systems/infiniband — the HPC-native alternative.
- systems/ai-zone — Meta's two-stage Clos topology template for RoCE training.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — Meta's paired deployments.
- concepts/fat-flow-load-balancing — the core problem RoCE-for-AI must solve.
- concepts/ecmp-equal-cost-multipath / concepts/enhanced-ecmp-qp-scaling / concepts/path-pinning — the routing-side evolution.
- concepts/dcqcn / concepts/priority-flow-control / concepts/receiver-driven-traffic-admission — the congestion-control stack.
- concepts/rdma-queue-pair — the RDMA abstraction hashed on by E-ECMP.
- concepts/backend-frontend-network-separation — the physical-fabric split.
- concepts/collective-communication-topology-awareness — the other stack-level optimisation.
- patterns/build-both-fabric-alternatives / patterns/dedicated-backend-training-fabric / patterns/collective-library-transport-codesign.