Skip to content

SYSTEM Cited by 1 source

AI Zone (Meta two-stage Clos training fabric)

AI Zone is Meta's name for the two-stage Clos topology that forms the backend training fabric of an AI cluster — a non-blocking spine+leaf design sized to host as many GPUs as a single fabric generation can reach, with an explicit aggregator layer (ATSW) above it for stitching multiple Zones into a data-center-scale fabric when a single Zone is too small. Introduced as an evolution from Meta's earlier star-topology RoCEv1 deployments, the AI Zone is the canonical unit of RoCE training capacity at Meta — the topology on which the 24K-GPU RoCE GenAI cluster and Llama 3.1 405B training rides.

Anatomy

Layer Device Role Physical media
Leaf RTSW — Rack Training Switch Connects all GPUs in a rack (scale-up) Copper DAC
Spine CTSW — Cluster Training Switch Connects all racks in the Zone (scale-out) Single-mode fiber + 400G pluggables
Aggregator (optional) ATSW — Aggregator Training Switch Stitches multiple Zones in one DC building Inter-Zone, oversubscribed

Key fabric properties:

  • Non-blocking inside the Zone. Any-to-any GPU throughput within one Zone is full bisection bandwidth.
  • CTSW has deep buffers, statically partitioned per port. Absorbs PFC pauses and bursty collective traffic without transport-level congestion control in the stable case.
  • Cross-AI-Zone links are deliberately oversubscribed. To keep them under-utilised, the training-job scheduler is made topology-aware (see patterns/minimum-cut-training-job-placement).

Why this shape

Meta iterated into the AI Zone topology from a simpler baseline:

  1. Initial star. First RoCE GPU clusters used "a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol." The limits were obvious: bounded GPU scale, single-switch redundancy.
  2. Two-stage Clos (the AI Zone). Leaf+spine gives scale-out + switch redundancy. Non-blocking inside the Zone means collective operations aren't bottlenecked on oversubscribed links.
  3. ATSW for LLM jobs. "Emerging AI advancements, such as LLMs like Llama, demand a GPU scale larger than what a single AI zone provides. To accommodate this, we designed an aggregator training switch (ATSW) layer that connects the CTSWs in a data center building, expanding the RoCE domain beyond a single AI Zone."

"The AI Zones are designed to support a large number of interconnected GPUs in a non-blocking manner." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Relationship to frontend network

A training rack is physically connected to two networks (see concepts/backend-frontend-network-separation):

  • Frontend (FE): traditional RSW → FSW → higher hierarchy, hosts storage-warehouse traffic for ingestion, checkpointing, logging.
  • Backend (BE) = AI Zone: the two-stage Clos described above, carries the training collective traffic over RoCEv2.

This split lets the BE evolve independently (new topology generations, new congestion-control schemes, DCQCN on/off — see concepts/dcqcn) without disturbing the FE. It's a canonical instance of patterns/dedicated-backend-training-fabric.

How the scheduler cooperates

At the ATSW layer, Meta faces a choice: over-provision the inter-Zone links (expensive) or keep them oversubscribed and reduce the traffic that crosses them. Meta picked the latter and pushed the optimisation into the scheduler:

"The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. … To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time."

This is documented on its own page: patterns/minimum-cut-training-job-placement.

Routing inside the Zone

Inside an AI Zone, load balancing between RTSW and CTSW uses ECMP. The routing evolution is the paper's second main topic:

  1. Baseline ECMP — failed under AI training's low-entropy / elephant-flow traffic.
  2. Path-pinning — Meta's first custom scheme; deterministic per-RTSW-downlink-slice assignment. Broke under partial rack allocation and network failures (>30% performance degradation).
  3. 2× RTSW uplink overprovisioning (1:2 under-subscription) — short-term bandaid to mask path-pinning fragmentation.
  4. Queue Pair scaling]] — hashing on RoCE QP field + NCCL spreading messages across QPs. +40% AllReduce vs baseline ECMP.

See concepts/fat-flow-load-balancing for why ECMP fails on training traffic in the first place.

Scale context (2024)

  • Up to 4K GPUs per Zone historically (prior to the 2024 scale step).
  • 24K GPUs on the Meta GenAI cluster via multiple Zones + ATSW (see systems/meta-genai-cluster-roce).
  • Meta describes "numerous clusters, each accommodating thousands of GPUs" — the AI Zone template is reused across the fleet, not bespoke per cluster.

Seen in

Last updated · 319 distilled / 1,201 read