SYSTEM Cited by 2 sources

RoCE (RDMA over Converged Ethernet)¶

RoCE is the standard for running RDMA over an Ethernet fabric — same RDMA semantics (kernel bypass, zero-copy, reliable or unreliable connected/datagram transports) as InfiniBand, but riding the Ethernet transport that hyperscalers already operate at scale. There are two variants — RoCEv1 (link-local) and RoCEv2 (routable over IP, production-dominant).

In the context of this wiki, RoCE shows up as the Ethernet-native alternative to InfiniBand for large AI training clusters: an operational-ecosystem choice more than a technical one, since ECMP + congestion control on Ethernet must be tuned to reach parity on GenAI workloads.

Seen in (wiki)¶

Meta 24K-GPU GenAI cluster (2024). Meta had four years of RoCE production experience (up to 4K GPUs) and made it the fabric for one of two 24K-GPU H100 clusters — optimised for fast build time, and the one on which the largest Llama 3 model was trained. Meta tuned the RoCE cluster to equivalent performance with the InfiniBand sibling on GenAI workloads via a three-part optimisation (parallelism-axis → topology-layer mapping, topology-aware collectives, fat-flow load balancing). (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale; see systems/meta-genai-cluster-roce)
Meta SIGCOMM 2024 RoCE paper (2024-08-05). Full engineering deep-dive on Meta's RoCE fabric for training at scale, supporting Llama 3.1 405B. Describes the two-stage Clos AI Zone topology (RTSW leaf + CTSW spine + optional ATSW aggregator), the routing evolution (baseline ECMP → concepts/path-pinning → E-ECMP + QP scaling with +40% on AllReduce), and the surprising congestion-control posture: DCQCN off at 400G for a year, PFC-only + NCCL-level receiver-driven admission instead. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

RoCE vs InfiniBand — the tradeoff Meta explicitly framed¶

	RoCE	InfiniBand
Transport	Ethernet	HPC-native
Operational tooling at hyperscale	Familiar (reused Ethernet practice)	Specialised
Build-speed advantage	Yes (Meta's 2024 framing)	No
Full-bisection-bandwidth framing	Had to be designed in	Native strength
Load balancing for fat flows	Requires explicit routing / LB work	Adaptive routing is built-in
Collective offload	NIC-dependent	Usually richer
Meta's 2024 decision	Build one 24K-GPU cluster	Build a second 24K-GPU cluster

See patterns/build-both-fabric-alternatives for the architectural pattern this tradeoff motivates.

Why fat flows are the key RoCE failure mode for training¶

Default Ethernet ECMP hashing assumes flows are many and short-lived; a tuple of source/dest IP/port/proto is enough to spread traffic. LLM training produces the opposite: a small number of very large, long-lived tensor transfers, which hash to a single path and saturate it while other paths stand idle. Meta names this explicitly — see concepts/fat-flow-load-balancing. Meta's Networking @Scale 2023 talk goes deeper.

Stub¶

More content to add as more sources (AWS EFA, Microsoft, Broadcom AI-Ethernet, Ultra Ethernet Consortium) come into the wiki. For now: the canonical wiki references are Meta's 2024-06-12 post and 2024-08-05 SIGCOMM paper.

Canonical Meta engineering stack (post SIGCOMM 2024)¶

The 2024-08-05 paper crystallises the end-state stack Meta runs on 400G RoCE training clusters:

Layer	Choice	Notes
Topology	Two-stage Clos AI Zone	RTSW leaf + CTSW spine; non-blocking inside Zone
Multi-Zone	ATSW aggregator, oversubscribed	patterns/minimum-cut-training-job-placement handles it
Physical / FE-BE	Dedicated backend network	Per patterns/dedicated-backend-training-fabric
Routing	E-ECMP + QP scaling	+40% AllReduce over baseline ECMP
Transport CC	DCQCN OFF	Firmware issues at 400G, redundant given admission control
Link CC	PFC	Sufficient with deep-buffer CTSW
Admission	NCCL CTS handshake	CTS high-priority-queued at switches
Scheduler	Topology-aware minimum-cut	Reduces cross-Zone traffic

This is a canonical instance of patterns/collective-library-transport-codesign.

systems/infiniband — the HPC-native alternative.
systems/ai-zone — Meta's two-stage Clos topology template for RoCE training.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — Meta's paired deployments.
concepts/fat-flow-load-balancing — the core problem RoCE-for-AI must solve.
concepts/ecmp-equal-cost-multipath / concepts/enhanced-ecmp-qp-scaling / concepts/path-pinning — the routing-side evolution.
concepts/dcqcn / concepts/priority-flow-control / concepts/receiver-driven-traffic-admission — the congestion-control stack.
concepts/rdma-queue-pair — the RDMA abstraction hashed on by E-ECMP.
concepts/backend-frontend-network-separation — the physical-fabric split.
concepts/collective-communication-topology-awareness — the other stack-level optimisation.
patterns/build-both-fabric-alternatives / patterns/dedicated-backend-training-fabric / patterns/collective-library-transport-codesign.