PATTERN Cited by 2 sources

Dedicated backend training fabric¶

Context¶

Hyperscale AI training clusters have qualitatively different network requirements from the rest of the data center:

Traffic shape — few, long-lived, elephant flows vs many short flows.
Loss tolerance — lossless required (RDMA); standard DC is lossy.
Bandwidth — full-bisection, non-blocking per GPU; standard DC accepts oversubscription.
Failure impact — one bad link stalls thousands of GPUs; standard DC services degrade gracefully.
Evolution cadence — tuning/retuning per-GPU-generation; standard DC evolves on multi-year cycles.

Trying to serve all of those requirements in a single shared fabric forces compromises that hurt both sides.

Pattern¶

Build the training fabric as a physically separate, dedicated network — parallel to but architecturally distinct from the rest of the data center. Each training rack is wired to two networks:

Frontend (FE) — standard DC hierarchy for ingestion / checkpoints / logging.
Backend (BE) — a specialised, non-blocking RDMA fabric (RoCE or InfiniBand) dedicated to training collective traffic.

The BE fabric evolves, operates, and scales on its own schedule. See concepts/backend-frontend-network-separation for the concept page; the canonical instance is Meta's AI Zone (leaf-spine two-stage Clos).

Meta's 2024-08-05 framing¶

"We built a dedicated backend network specifically for distributed training. This allowed us to evolve, operate, and scale independently from the rest of the data center network." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

This is the architectural motivation in one sentence — the key word is "independently."

Design choices the pattern enables¶

Because the BE fabric is its own thing, Meta can make decisions that would be disruptive if made on a shared fabric:

Custom switch firmware — UDF hashing for E-ECMP+QP — not generic.
Non-standard routing — concepts/path-pinning then E-ECMP — each is tried without affecting other services.
DCQCN off at 400G — an unusual choice that would be untenable on a shared lossless fabric.
Deep-buffer spine. Expensive per-port; justified only for training workloads.
400G-fiber spine early. Before the DC transitioned more broadly.
Dedicated failure-domain isolation — a BE congestion incident doesn't cascade to storage or services.

When to use¶

Training cluster scale ≥ a rack or two — below that, a shared fabric is cheaper.
GPU interconnect is a first-order constraint on training throughput — i.e. the fabric is on the critical path.
RDMA is used for collective communication. TCP-only workloads don't need the BE/FE split.
Operational independence is valuable. If the team running training wants to evolve the fabric without coordinating with the generic-DC networking team.

When not to use¶

Small-scale clusters. The dual-NIC cost + separate operations aren't amortised.
Training workload without collective-comm intensity. Inference serving often doesn't need BE-class isolation.
Colocated multi-tenant where training is one workload among many — economics may favour a single tuned fabric over two.

Variants¶

RoCE backend (Meta, Microsoft, Azure) — Ethernet-native, operational familiarity.
InfiniBand backend (Meta, NVIDIA DGX SuperPOD, CoreWeave) — HPC-native, adaptive routing, richer collective offload.
Both (Meta 24K × 2) — see patterns/build-both-fabric-alternatives for the meta-pattern of paired builds.
Proprietary (Google TPU pods, AWS EFA) — special cases with tighter host/NIC co-design.

Relationship to adjacent patterns¶

patterns/build-both-fabric-alternatives — when you've already committed to the BE split, the next question is which BE technology. The meta-pattern handles that.
patterns/collective-library-transport-codesign — once you have a dedicated BE, you can co-design the library (NCCL) and the transport (switch QoS + admission). Much harder on a shared fabric.

Wiki instances¶

Meta AI Zone + systems/meta-genai-cluster-roce|24K-GPU RoCE GenAI cluster (2024). Canonical wiki reference. BE = RoCE two-stage Clos. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale, sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale.)
Meta systems/meta-genai-cluster-infiniband|24K-GPU InfiniBand GenAI cluster (2024). Sibling BE on InfiniBand.
(Implicit) other hyperscaler training deployments follow the same pattern but aren't yet ingested with this level of detail.

concepts/backend-frontend-network-separation — the concept side.
concepts/fat-flow-load-balancing / concepts/ecmp-equal-cost-multipath — BE-specific problems.
systems/ai-zone — Meta's BE template.
systems/roce-rdma-over-converged-ethernet / systems/infiniband — BE fabric options.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the two Meta BE deployments.
patterns/build-both-fabric-alternatives / patterns/collective-library-transport-codesign — related patterns.
companies/meta.