Skip to content

SYSTEM Cited by 2 sources

RoCE (RDMA over Converged Ethernet)

RoCE is the standard for running RDMA over an Ethernet fabric — same RDMA semantics (kernel bypass, zero-copy, reliable or unreliable connected/datagram transports) as InfiniBand, but riding the Ethernet transport that hyperscalers already operate at scale. There are two variants — RoCEv1 (link-local) and RoCEv2 (routable over IP, production-dominant).

In the context of this wiki, RoCE shows up as the Ethernet-native alternative to InfiniBand for large AI training clusters: an operational-ecosystem choice more than a technical one, since ECMP + congestion control on Ethernet must be tuned to reach parity on GenAI workloads.

Seen in (wiki)

RoCE vs InfiniBand — the tradeoff Meta explicitly framed

RoCE InfiniBand
Transport Ethernet HPC-native
Operational tooling at hyperscale Familiar (reused Ethernet practice) Specialised
Build-speed advantage Yes (Meta's 2024 framing) No
Full-bisection-bandwidth framing Had to be designed in Native strength
Load balancing for fat flows Requires explicit routing / LB work Adaptive routing is built-in
Collective offload NIC-dependent Usually richer
Meta's 2024 decision Build one 24K-GPU cluster Build a second 24K-GPU cluster

See patterns/build-both-fabric-alternatives for the architectural pattern this tradeoff motivates.

Why fat flows are the key RoCE failure mode for training

Default Ethernet ECMP hashing assumes flows are many and short-lived; a tuple of source/dest IP/port/proto is enough to spread traffic. LLM training produces the opposite: a small number of very large, long-lived tensor transfers, which hash to a single path and saturate it while other paths stand idle. Meta names this explicitly — see concepts/fat-flow-load-balancing. Meta's Networking @Scale 2023 talk goes deeper.

Stub

More content to add as more sources (AWS EFA, Microsoft, Broadcom AI-Ethernet, Ultra Ethernet Consortium) come into the wiki. For now: the canonical wiki references are Meta's 2024-06-12 post and 2024-08-05 SIGCOMM paper.

Canonical Meta engineering stack (post SIGCOMM 2024)

The 2024-08-05 paper crystallises the end-state stack Meta runs on 400G RoCE training clusters:

Layer Choice Notes
Topology Two-stage Clos AI Zone RTSW leaf + CTSW spine; non-blocking inside Zone
Multi-Zone ATSW aggregator, oversubscribed patterns/minimum-cut-training-job-placement handles it
Physical / FE-BE Dedicated backend network Per patterns/dedicated-backend-training-fabric
Routing E-ECMP + QP scaling +40% AllReduce over baseline ECMP
Transport CC DCQCN OFF Firmware issues at 400G, redundant given admission control
Link CC PFC Sufficient with deep-buffer CTSW
Admission NCCL CTS handshake CTS high-priority-queued at switches
Scheduler Topology-aware minimum-cut Reduces cross-Zone traffic

This is a canonical instance of patterns/collective-library-transport-codesign.

Last updated · 319 distilled / 1,201 read