PATTERN Cited by 2 sources
Dedicated backend training fabric¶
Context¶
Hyperscale AI training clusters have qualitatively different network requirements from the rest of the data center:
- Traffic shape — few, long-lived, elephant flows vs many short flows.
- Loss tolerance — lossless required (RDMA); standard DC is lossy.
- Bandwidth — full-bisection, non-blocking per GPU; standard DC accepts oversubscription.
- Failure impact — one bad link stalls thousands of GPUs; standard DC services degrade gracefully.
- Evolution cadence — tuning/retuning per-GPU-generation; standard DC evolves on multi-year cycles.
Trying to serve all of those requirements in a single shared fabric forces compromises that hurt both sides.
Pattern¶
Build the training fabric as a physically separate, dedicated network — parallel to but architecturally distinct from the rest of the data center. Each training rack is wired to two networks:
- Frontend (FE) — standard DC hierarchy for ingestion / checkpoints / logging.
- Backend (BE) — a specialised, non-blocking RDMA fabric (RoCE or InfiniBand) dedicated to training collective traffic.
The BE fabric evolves, operates, and scales on its own schedule. See concepts/backend-frontend-network-separation for the concept page; the canonical instance is Meta's AI Zone (leaf-spine two-stage Clos).
Meta's 2024-08-05 framing¶
"We built a dedicated backend network specifically for distributed training. This allowed us to evolve, operate, and scale independently from the rest of the data center network." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
This is the architectural motivation in one sentence — the key word is "independently."
Design choices the pattern enables¶
Because the BE fabric is its own thing, Meta can make decisions that would be disruptive if made on a shared fabric:
- Custom switch firmware — UDF hashing for E-ECMP+QP — not generic.
- Non-standard routing — concepts/path-pinning then E-ECMP — each is tried without affecting other services.
- DCQCN off at 400G — an unusual choice that would be untenable on a shared lossless fabric.
- Deep-buffer spine. Expensive per-port; justified only for training workloads.
- 400G-fiber spine early. Before the DC transitioned more broadly.
- Dedicated failure-domain isolation — a BE congestion incident doesn't cascade to storage or services.
When to use¶
- Training cluster scale ≥ a rack or two — below that, a shared fabric is cheaper.
- GPU interconnect is a first-order constraint on training throughput — i.e. the fabric is on the critical path.
- RDMA is used for collective communication. TCP-only workloads don't need the BE/FE split.
- Operational independence is valuable. If the team running training wants to evolve the fabric without coordinating with the generic-DC networking team.
When not to use¶
- Small-scale clusters. The dual-NIC cost + separate operations aren't amortised.
- Training workload without collective-comm intensity. Inference serving often doesn't need BE-class isolation.
- Colocated multi-tenant where training is one workload among many — economics may favour a single tuned fabric over two.
Variants¶
- RoCE backend (Meta, Microsoft, Azure) — Ethernet-native, operational familiarity.
- InfiniBand backend (Meta, NVIDIA DGX SuperPOD, CoreWeave) — HPC-native, adaptive routing, richer collective offload.
- Both (Meta 24K × 2) — see patterns/build-both-fabric-alternatives for the meta-pattern of paired builds.
- Proprietary (Google TPU pods, AWS EFA) — special cases with tighter host/NIC co-design.
Relationship to adjacent patterns¶
- patterns/build-both-fabric-alternatives — when you've already committed to the BE split, the next question is which BE technology. The meta-pattern handles that.
- patterns/collective-library-transport-codesign — once you have a dedicated BE, you can co-design the library (NCCL) and the transport (switch QoS + admission). Much harder on a shared fabric.
Wiki instances¶
- Meta AI Zone + systems/meta-genai-cluster-roce|24K-GPU RoCE GenAI cluster (2024). Canonical wiki reference. BE = RoCE two-stage Clos. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale, sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale.)
- Meta systems/meta-genai-cluster-infiniband|24K-GPU InfiniBand GenAI cluster (2024). Sibling BE on InfiniBand.
- (Implicit) other hyperscaler training deployments follow the same pattern but aren't yet ingested with this level of detail.
Related¶
- concepts/backend-frontend-network-separation — the concept side.
- concepts/fat-flow-load-balancing / concepts/ecmp-equal-cost-multipath — BE-specific problems.
- systems/ai-zone — Meta's BE template.
- systems/roce-rdma-over-converged-ethernet / systems/infiniband — BE fabric options.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two Meta BE deployments.
- patterns/build-both-fabric-alternatives / patterns/collective-library-transport-codesign — related patterns.
- companies/meta.