SYSTEM Cited by 1 source
AI Zone (Meta two-stage Clos training fabric)¶
AI Zone is Meta's name for the two-stage Clos topology that forms the backend training fabric of an AI cluster — a non-blocking spine+leaf design sized to host as many GPUs as a single fabric generation can reach, with an explicit aggregator layer (ATSW) above it for stitching multiple Zones into a data-center-scale fabric when a single Zone is too small. Introduced as an evolution from Meta's earlier star-topology RoCEv1 deployments, the AI Zone is the canonical unit of RoCE training capacity at Meta — the topology on which the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training rides.
Anatomy¶
| Layer | Device | Role | Physical media |
|---|---|---|---|
| Leaf | RTSW — Rack Training Switch | Connects all GPUs in a rack (scale-up) | Copper DAC |
| Spine | CTSW — Cluster Training Switch | Connects all racks in the Zone (scale-out) | Single-mode fiber + 400G pluggables |
| Aggregator (optional) | ATSW — Aggregator Training Switch | Stitches multiple Zones in one DC building | Inter-Zone, oversubscribed |
Key fabric properties:
- Non-blocking inside the Zone. Any-to-any GPU throughput within one Zone is full bisection bandwidth.
- CTSW has deep buffers, statically partitioned per port. Absorbs PFC pauses and bursty collective traffic without transport-level congestion control in the stable case.
- Cross-AI-Zone links are deliberately oversubscribed. To keep them under-utilised, the training-job scheduler is made topology-aware (see patterns/minimum-cut-training-job-placement).
Why this shape¶
Meta iterated into the AI Zone topology from a simpler baseline:
- Initial star. First RoCE GPU clusters used "a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol." The limits were obvious: bounded GPU scale, single-switch redundancy.
- Two-stage Clos (the AI Zone). Leaf+spine gives scale-out + switch redundancy. Non-blocking inside the Zone means collective operations aren't bottlenecked on oversubscribed links.
- ATSW for LLM jobs. "Emerging AI advancements, such as LLMs like Llama, demand a GPU scale larger than what a single AI zone provides. To accommodate this, we designed an aggregator training switch (ATSW) layer that connects the CTSWs in a data center building, expanding the RoCE domain beyond a single AI Zone."
"The AI Zones are designed to support a large number of interconnected GPUs in a non-blocking manner." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Relationship to frontend network¶
A training rack is physically connected to two networks (see concepts/backend-frontend-network-separation):
- Frontend (FE): traditional RSW → FSW → higher hierarchy, hosts storage-warehouse traffic for ingestion, checkpointing, logging.
- Backend (BE) = AI Zone: the two-stage Clos described above, carries the training collective traffic over RoCEv2.
This split lets the BE evolve independently (new topology generations, new congestion-control schemes, DCQCN on/off — see concepts/dcqcn) without disturbing the FE. It's a canonical instance of patterns/dedicated-backend-training-fabric.
How the scheduler cooperates¶
At the ATSW layer, Meta faces a choice: over-provision the inter-Zone links (expensive) or keep them oversubscribed and reduce the traffic that crosses them. Meta picked the latter and pushed the optimisation into the scheduler:
"The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. … To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time."
This is documented on its own page: patterns/minimum-cut-training-job-placement.
Routing inside the Zone¶
Inside an AI Zone, load balancing between RTSW and CTSW uses ECMP. The routing evolution is the paper's second main topic:
- Baseline ECMP — failed under AI training's low-entropy / elephant-flow traffic.
- Path-pinning — Meta's first custom scheme; deterministic per-RTSW-downlink-slice assignment. Broke under partial rack allocation and network failures (>30% performance degradation).
- 2× RTSW uplink overprovisioning (1:2 under-subscription) — short-term bandaid to mask path-pinning fragmentation.
- Queue Pair scaling]] — hashing on RoCE QP field + NCCL spreading messages across QPs. +40% AllReduce vs baseline ECMP.
See concepts/fat-flow-load-balancing for why ECMP fails on training traffic in the first place.
Scale context (2024)¶
- Up to 4K GPUs per Zone historically (prior to the 2024 scale step).
- 24K GPUs on the Meta GenAI cluster via multiple Zones + ATSW (see systems/meta-genai-cluster-roce).
- Meta describes "numerous clusters, each accommodating thousands of GPUs" — the AI Zone template is reused across the fleet, not bespoke per cluster.
Seen in¶
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; the paper's topology section introduces the Zone vocabulary.
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the earlier overview post hints at the topology without naming it.
Related¶
- systems/roce-rdma-over-converged-ethernet — the fabric technology the Zone is built on.
- systems/meta-genai-cluster-roce — the 24K-GPU deployment that uses this template at multi-Zone scale.
- systems/infiniband — the fabric this design competes with; IB's adaptive routing + HCA-level CC make fewer of these decisions customer-facing.
- concepts/ecmp-equal-cost-multipath / concepts/fat-flow-load-balancing — the routing problems the Zone must solve.
- concepts/backend-frontend-network-separation — what the Zone is the BE of.
- concepts/collective-communication-topology-awareness — algorithm-side counterpart to the Zone's topological structure.
- patterns/dedicated-backend-training-fabric — the architectural pattern the AI Zone expresses.
- patterns/minimum-cut-training-job-placement — the scheduler-side optimisation for cross-Zone oversubscription.
- companies/meta.