Skip to content

PATTERN Cited by 2 sources

Dedicated backend training fabric

Context

Hyperscale AI training clusters have qualitatively different network requirements from the rest of the data center:

  • Traffic shape — few, long-lived, elephant flows vs many short flows.
  • Loss tolerance — lossless required (RDMA); standard DC is lossy.
  • Bandwidth — full-bisection, non-blocking per GPU; standard DC accepts oversubscription.
  • Failure impact — one bad link stalls thousands of GPUs; standard DC services degrade gracefully.
  • Evolution cadence — tuning/retuning per-GPU-generation; standard DC evolves on multi-year cycles.

Trying to serve all of those requirements in a single shared fabric forces compromises that hurt both sides.

Pattern

Build the training fabric as a physically separate, dedicated network — parallel to but architecturally distinct from the rest of the data center. Each training rack is wired to two networks:

  • Frontend (FE) — standard DC hierarchy for ingestion / checkpoints / logging.
  • Backend (BE) — a specialised, non-blocking RDMA fabric (RoCE or InfiniBand) dedicated to training collective traffic.

The BE fabric evolves, operates, and scales on its own schedule. See concepts/backend-frontend-network-separation for the concept page; the canonical instance is Meta's AI Zone (leaf-spine two-stage Clos).

Meta's 2024-08-05 framing

"We built a dedicated backend network specifically for distributed training. This allowed us to evolve, operate, and scale independently from the rest of the data center network." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

This is the architectural motivation in one sentence — the key word is "independently."

Design choices the pattern enables

Because the BE fabric is its own thing, Meta can make decisions that would be disruptive if made on a shared fabric:

  1. Custom switch firmware — UDF hashing for E-ECMP+QP — not generic.
  2. Non-standard routingconcepts/path-pinning then E-ECMP — each is tried without affecting other services.
  3. DCQCN off at 400G — an unusual choice that would be untenable on a shared lossless fabric.
  4. Deep-buffer spine. Expensive per-port; justified only for training workloads.
  5. 400G-fiber spine early. Before the DC transitioned more broadly.
  6. Dedicated failure-domain isolation — a BE congestion incident doesn't cascade to storage or services.

When to use

  • Training cluster scale ≥ a rack or two — below that, a shared fabric is cheaper.
  • GPU interconnect is a first-order constraint on training throughput — i.e. the fabric is on the critical path.
  • RDMA is used for collective communication. TCP-only workloads don't need the BE/FE split.
  • Operational independence is valuable. If the team running training wants to evolve the fabric without coordinating with the generic-DC networking team.

When not to use

  • Small-scale clusters. The dual-NIC cost + separate operations aren't amortised.
  • Training workload without collective-comm intensity. Inference serving often doesn't need BE-class isolation.
  • Colocated multi-tenant where training is one workload among many — economics may favour a single tuned fabric over two.

Variants

  • RoCE backend (Meta, Microsoft, Azure) — Ethernet-native, operational familiarity.
  • InfiniBand backend (Meta, NVIDIA DGX SuperPOD, CoreWeave) — HPC-native, adaptive routing, richer collective offload.
  • Both (Meta 24K × 2) — see patterns/build-both-fabric-alternatives for the meta-pattern of paired builds.
  • Proprietary (Google TPU pods, AWS EFA) — special cases with tighter host/NIC co-design.

Relationship to adjacent patterns

Wiki instances

Last updated · 319 distilled / 1,201 read