Skip to content

CONCEPT Cited by 1 source

Backend/frontend network separation

Definition

In large GPU training clusters, backend/frontend (BE/FE) network separation is the practice of connecting each training rack to two distinct physical networks with different design goals:

  • Frontend (FE) network — standard data-center hierarchy (rack switches → fabric switches → higher). Carries data ingestion, checkpointing, logging traffic between GPUs and the storage warehouse. Shares topology and tooling with the rest of the DC.
  • Backend (BE) network — a specialised, dedicated fabric (typically RoCE or InfiniBand) connecting all RDMA NICs in a non-blocking architecture. Carries training collective traffic (AllReduce, AllGather, Reduce-Scatter, etc.) at high bandwidth, low latency, lossless.

The two networks carry qualitatively different traffic and stress the fabric on different axes, so they're engineered and operated as separate systems.

Meta's framing (2024-08-05 SIGCOMM)

"The training cluster relies on two independent networks: the frontend (FE) network for tasks such as data ingestion, checkpointing, and logging, and the backend (BE) network for training, as depicted below."

"The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location. This backend fabric utilizes the RoCEv2 protocol."

(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why separate at all

The BE fabric's performance envelope is inherently different from the FE's:

Axis Frontend Backend
Traffic shape Many flows, typical DC distribution Few, long-lived, elephant-fat
Loss tolerance TCP — lossy is fine Lossless required (RoCE/RDMA)
Congestion control Standard ECN / TCP CC DCQCN or PFC-based
Load balancing ECMP hashing just works Fat flow — ECMP fails
Buffering Shallow OK Deep buffers on spine
Evolution cadence Slow — affects all services Fast — per-generation retuning

Forcing both sets of requirements onto a single network means compromising both. Physical separation lets each evolve on its own schedule: the BE can try turning DCQCN off, change ECMP hash configuration, deploy a new topology, or upgrade link speeds without touching the FE — and vice versa.

Operational consequences

  • Independent tooling. The BE fabric can use custom switch firmware, routing schemes, and telemetry stacks that the FE doesn't need.
  • Independent failure domain. Storage-warehouse congestion doesn't slow training; training AllReduce bursts don't stall checkpoint writes.
  • Independent capacity planning. The FE scales with storage-warehouse fan-out; the BE scales with GPU count and collective-communication bandwidth.
  • Cost is explicit. Training racks pay for two NIC ports and two cable runs. For the GPU-dominated BOM this is a rounding error.

The FE/BE split is the concrete instance at Meta of the broader architectural pattern patterns/dedicated-backend-training-fabric.

Relationship to adjacent concepts

Scale context

Meta's training rack setup (per the 2024-08-05 SIGCOMM paper):

  • FE connection: rack → RSW → FSW → higher-hierarchy (standard DC Clos)
  • BE connection: rack → RTSW → CTSW → (ATSW) (dedicated RoCE fabric)

Ingress bandwidth on the FE's RSW is sized so that storage-warehouse reads don't starve training; BE spine is non-blocking.

Seen in

Last updated · 319 distilled / 1,201 read