CONCEPT Cited by 1 source
Backend/frontend network separation¶
Definition¶
In large GPU training clusters, backend/frontend (BE/FE) network separation is the practice of connecting each training rack to two distinct physical networks with different design goals:
- Frontend (FE) network — standard data-center hierarchy (rack switches → fabric switches → higher). Carries data ingestion, checkpointing, logging traffic between GPUs and the storage warehouse. Shares topology and tooling with the rest of the DC.
- Backend (BE) network — a specialised, dedicated fabric (typically RoCE or InfiniBand) connecting all RDMA NICs in a non-blocking architecture. Carries training collective traffic (AllReduce, AllGather, Reduce-Scatter, etc.) at high bandwidth, low latency, lossless.
The two networks carry qualitatively different traffic and stress the fabric on different axes, so they're engineered and operated as separate systems.
Meta's framing (2024-08-05 SIGCOMM)¶
"The training cluster relies on two independent networks: the frontend (FE) network for tasks such as data ingestion, checkpointing, and logging, and the backend (BE) network for training, as depicted below."
"The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location. This backend fabric utilizes the RoCEv2 protocol."
(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Why separate at all¶
The BE fabric's performance envelope is inherently different from the FE's:
| Axis | Frontend | Backend |
|---|---|---|
| Traffic shape | Many flows, typical DC distribution | Few, long-lived, elephant-fat |
| Loss tolerance | TCP — lossy is fine | Lossless required (RoCE/RDMA) |
| Congestion control | Standard ECN / TCP CC | DCQCN or PFC-based |
| Load balancing | ECMP hashing just works | Fat flow — ECMP fails |
| Buffering | Shallow OK | Deep buffers on spine |
| Evolution cadence | Slow — affects all services | Fast — per-generation retuning |
Forcing both sets of requirements onto a single network means compromising both. Physical separation lets each evolve on its own schedule: the BE can try turning DCQCN off, change ECMP hash configuration, deploy a new topology, or upgrade link speeds without touching the FE — and vice versa.
Operational consequences¶
- Independent tooling. The BE fabric can use custom switch firmware, routing schemes, and telemetry stacks that the FE doesn't need.
- Independent failure domain. Storage-warehouse congestion doesn't slow training; training AllReduce bursts don't stall checkpoint writes.
- Independent capacity planning. The FE scales with storage-warehouse fan-out; the BE scales with GPU count and collective-communication bandwidth.
- Cost is explicit. Training racks pay for two NIC ports and two cable runs. For the GPU-dominated BOM this is a rounding error.
The FE/BE split is the concrete instance at Meta of the broader architectural pattern patterns/dedicated-backend-training-fabric.
Relationship to adjacent concepts¶
- concepts/control-plane-data-plane-separation — FE/BE is a data-plane/data-plane split (both carry data), not control/data. The motivation is different (different traffic shapes, not different operational-risk profiles), but the architectural shape is similar.
- systems/ai-zone — what Meta calls the BE side specifically; the FE side stays on the generic DC hierarchy.
- systems/infiniband vs systems/roce-rdma-over-converged-ethernet — both are BE-fabric candidates; the FE network is always Ethernet.
Scale context¶
Meta's training rack setup (per the 2024-08-05 SIGCOMM paper):
- FE connection: rack → RSW → FSW → higher-hierarchy (standard DC Clos)
- BE connection: rack → RTSW → CTSW → (ATSW) (dedicated RoCE fabric)
Ingress bandwidth on the FE's RSW is sized so that storage-warehouse reads don't starve training; BE spine is non-blocking.
Seen in¶
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; Meta opens its topology section with this split.
Related¶
- concepts/control-plane-data-plane-separation — adjacent architectural split.
- concepts/fat-flow-load-balancing — the problem that motivates the BE's specialised design.
- concepts/ecmp-equal-cost-multipath — the primitive whose BE-specific failure modes drove the split.
- systems/ai-zone — Meta's BE fabric template.
- systems/roce-rdma-over-converged-ethernet / systems/infiniband — the BE fabric options.
- systems/meta-genai-cluster-roce — the deployment that instantiates this.
- patterns/dedicated-backend-training-fabric — the pattern the concept supports.