CONCEPT Cited by 1 source

Backend/frontend network separation¶

Definition¶

In large GPU training clusters, backend/frontend (BE/FE) network separation is the practice of connecting each training rack to two distinct physical networks with different design goals:

Frontend (FE) network — standard data-center hierarchy (rack switches → fabric switches → higher). Carries data ingestion, checkpointing, logging traffic between GPUs and the storage warehouse. Shares topology and tooling with the rest of the DC.
Backend (BE) network — a specialised, dedicated fabric (typically RoCE or InfiniBand) connecting all RDMA NICs in a non-blocking architecture. Carries training collective traffic (AllReduce, AllGather, Reduce-Scatter, etc.) at high bandwidth, low latency, lossless.

The two networks carry qualitatively different traffic and stress the fabric on different axes, so they're engineered and operated as separate systems.

Meta's framing (2024-08-05 SIGCOMM)¶

"The training cluster relies on two independent networks: the frontend (FE) network for tasks such as data ingestion, checkpointing, and logging, and the backend (BE) network for training, as depicted below."

"The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location. This backend fabric utilizes the RoCEv2 protocol."

(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why separate at all¶

The BE fabric's performance envelope is inherently different from the FE's:

Axis	Frontend	Backend
Traffic shape	Many flows, typical DC distribution	Few, long-lived, elephant-fat
Loss tolerance	TCP — lossy is fine	Lossless required (RoCE/RDMA)
Congestion control	Standard ECN / TCP CC	DCQCN or PFC-based
Load balancing	ECMP hashing just works	Fat flow — ECMP fails
Buffering	Shallow OK	Deep buffers on spine
Evolution cadence	Slow — affects all services	Fast — per-generation retuning

Forcing both sets of requirements onto a single network means compromising both. Physical separation lets each evolve on its own schedule: the BE can try turning DCQCN off, change ECMP hash configuration, deploy a new topology, or upgrade link speeds without touching the FE — and vice versa.

Operational consequences¶

Independent tooling. The BE fabric can use custom switch firmware, routing schemes, and telemetry stacks that the FE doesn't need.
Independent failure domain. Storage-warehouse congestion doesn't slow training; training AllReduce bursts don't stall checkpoint writes.
Independent capacity planning. The FE scales with storage-warehouse fan-out; the BE scales with GPU count and collective-communication bandwidth.
Cost is explicit. Training racks pay for two NIC ports and two cable runs. For the GPU-dominated BOM this is a rounding error.

The FE/BE split is the concrete instance at Meta of the broader architectural pattern patterns/dedicated-backend-training-fabric.

Relationship to adjacent concepts¶

concepts/control-plane-data-plane-separation — FE/BE is a data-plane/data-plane split (both carry data), not control/data. The motivation is different (different traffic shapes, not different operational-risk profiles), but the architectural shape is similar.
systems/ai-zone — what Meta calls the BE side specifically; the FE side stays on the generic DC hierarchy.
systems/infiniband vs systems/roce-rdma-over-converged-ethernet — both are BE-fabric candidates; the FE network is always Ethernet.

Scale context¶

Meta's training rack setup (per the 2024-08-05 SIGCOMM paper):

FE connection: rack → RSW → FSW → higher-hierarchy (standard DC Clos)
BE connection: rack → RTSW → CTSW → (ATSW) (dedicated RoCE fabric)

Ingress bandwidth on the FE's RSW is sized so that storage-warehouse reads don't starve training; BE spine is non-blocking.

Seen in¶

sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; Meta opens its topology section with this split.

concepts/control-plane-data-plane-separation — adjacent architectural split.
concepts/fat-flow-load-balancing — the problem that motivates the BE's specialised design.
concepts/ecmp-equal-cost-multipath — the primitive whose BE-specific failure modes drove the split.
systems/ai-zone — Meta's BE fabric template.
systems/roce-rdma-over-converged-ethernet / systems/infiniband — the BE fabric options.
systems/meta-genai-cluster-roce — the deployment that instantiates this.
patterns/dedicated-backend-training-fabric — the pattern the concept supports.