SYSTEM Cited by 1 source

AI Zone (Meta two-stage Clos training fabric)¶

AI Zone is Meta's name for the two-stage Clos topology that forms the backend training fabric of an AI cluster — a non-blocking spine+leaf design sized to host as many GPUs as a single fabric generation can reach, with an explicit aggregator layer (ATSW) above it for stitching multiple Zones into a data-center-scale fabric when a single Zone is too small. Introduced as an evolution from Meta's earlier star-topology RoCEv1 deployments, the AI Zone is the canonical unit of RoCE training capacity at Meta — the topology on which the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training rides.

Anatomy¶

Layer	Device	Role	Physical media
Leaf	RTSW — Rack Training Switch	Connects all GPUs in a rack (scale-up)	Copper DAC
Spine	CTSW — Cluster Training Switch	Connects all racks in the Zone (scale-out)	Single-mode fiber + 400G pluggables
Aggregator (optional)	ATSW — Aggregator Training Switch	Stitches multiple Zones in one DC building	Inter-Zone, oversubscribed

Key fabric properties:

Non-blocking inside the Zone. Any-to-any GPU throughput within one Zone is full bisection bandwidth.
CTSW has deep buffers, statically partitioned per port. Absorbs PFC pauses and bursty collective traffic without transport-level congestion control in the stable case.
Cross-AI-Zone links are deliberately oversubscribed. To keep them under-utilised, the training-job scheduler is made topology-aware (see patterns/minimum-cut-training-job-placement).

Why this shape¶

Meta iterated into the AI Zone topology from a simpler baseline:

Initial star. First RoCE GPU clusters used "a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol." The limits were obvious: bounded GPU scale, single-switch redundancy.
Two-stage Clos (the AI Zone). Leaf+spine gives scale-out + switch redundancy. Non-blocking inside the Zone means collective operations aren't bottlenecked on oversubscribed links.
ATSW for LLM jobs. "Emerging AI advancements, such as LLMs like Llama, demand a GPU scale larger than what a single AI zone provides. To accommodate this, we designed an aggregator training switch (ATSW) layer that connects the CTSWs in a data center building, expanding the RoCE domain beyond a single AI Zone."

"The AI Zones are designed to support a large number of interconnected GPUs in a non-blocking manner." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Relationship to frontend network¶

A training rack is physically connected to two networks (see concepts/backend-frontend-network-separation):

Frontend (FE): traditional RSW → FSW → higher hierarchy, hosts storage-warehouse traffic for ingestion, checkpointing, logging.
Backend (BE) = AI Zone: the two-stage Clos described above, carries the training collective traffic over RoCEv2.

This split lets the BE evolve independently (new topology generations, new congestion-control schemes, DCQCN on/off — see concepts/dcqcn) without disturbing the FE. It's a canonical instance of patterns/dedicated-backend-training-fabric.

How the scheduler cooperates¶

At the ATSW layer, Meta faces a choice: over-provision the inter-Zone links (expensive) or keep them oversubscribed and reduce the traffic that crosses them. Meta picked the latter and pushed the optimisation into the scheduler:

"The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. … To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time."

This is documented on its own page: patterns/minimum-cut-training-job-placement.

Routing inside the Zone¶

Inside an AI Zone, load balancing between RTSW and CTSW uses ECMP. The routing evolution is the paper's second main topic:

Baseline ECMP — failed under AI training's low-entropy / elephant-flow traffic.
Path-pinning — Meta's first custom scheme; deterministic per-RTSW-downlink-slice assignment. Broke under partial rack allocation and network failures (>30% performance degradation).
2× RTSW uplink overprovisioning (1:2 under-subscription) — short-term bandaid to mask path-pinning fragmentation.
Queue Pair scaling]] — hashing on RoCE QP field + NCCL spreading messages across QPs. +40% AllReduce vs baseline ECMP.

See concepts/fat-flow-load-balancing for why ECMP fails on training traffic in the first place.

Scale context (2024)¶

Up to 4K GPUs per Zone historically (prior to the 2024 scale step).
24K GPUs on the Meta GenAI cluster via multiple Zones + ATSW (see systems/meta-genai-cluster-roce).
Meta describes "numerous clusters, each accommodating thousands of GPUs" — the AI Zone template is reused across the fleet, not bespoke per cluster.

Seen in¶

sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; the paper's topology section introduces the Zone vocabulary.
sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — the earlier overview post hints at the topology without naming it.

systems/roce-rdma-over-converged-ethernet — the fabric technology the Zone is built on.
systems/meta-genai-cluster-roce — the 24K-GPU deployment that uses this template at multi-Zone scale.
systems/infiniband — the fabric this design competes with; IB's adaptive routing + HCA-level CC make fewer of these decisions customer-facing.
concepts/ecmp-equal-cost-multipath / concepts/fat-flow-load-balancing — the routing problems the Zone must solve.
concepts/backend-frontend-network-separation — what the Zone is the BE of.
concepts/collective-communication-topology-awareness — algorithm-side counterpart to the Zone's topological structure.
patterns/dedicated-backend-training-fabric — the architectural pattern the AI Zone expresses.
patterns/minimum-cut-training-job-placement — the scheduler-side optimisation for cross-Zone oversubscription.
companies/meta.