PATTERN Cited by 1 source

Minimum-cut training job placement¶

Context¶

Training fabrics for LLM-scale jobs often exceed the size of a single non-blocking AI Zone / pod. The fix is an aggregator layer (ATSW in Meta's terms) that stitches Zones into a data-center-scale fabric — but that layer is typically oversubscribed because making it non-blocking is prohibitively expensive.

Oversubscribed links under training workloads are dangerous: a single fat flow crossing them saturates the inter-Zone bandwidth budget. If a training job lands randomly across Zones, the cross-Zone traffic is dominated by luck.

Pattern¶

Make the training-job scheduler topology-aware. When a job is too large for a single Zone, the scheduler computes a minimum-cut partition of the job's ranks into Zones — minimising the number of collective-communication edges that cross the oversubscribed aggregator layer — then recommends a rank assignment that lands communicating peers in the same Zone when possible.

Two mechanics:

Minimum-cut partitioning — model the job's communication graph (parallelism axes, collective peers); partition it into Zone-sized subsets to minimise cross-Zone edges.
Topology-aware rank assignment — learn each GPU's logical position (Zone / rack / slot) and assign MPI/NCCL ranks so that adjacent-rank collectives (e.g. ring neighbours) fall within the same fabric region.

Meta's 2024-08-05 framing¶

"Note, the cross-AI Zone connectivity is oversubscribed by design, with network traffic balanced using ECMP. To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why this is load-bearing¶

Without topology awareness, the cross-Zone link bandwidth is the per-step latency floor for any job that spans Zones. A poorly-partitioned job can spend an order of magnitude more time on cross-Zone traffic than a well-partitioned one, even on the same fabric.

With topology awareness, the scheduler aligns the parallelism structure with the network structure:

Tensor-parallel all-reduces (latency-sensitive, small messages) stay within a Zone.
Data-parallel all-reduces (larger, less frequent) can cross Zones if needed.
Pipeline-parallel point-to-point handoffs are low-volume and can cross Zones cheaply.

This is the scheduler-side expression of concepts/collective-communication-topology-awareness — the algorithm-side is choosing the right collective algorithm, the scheduler-side is choosing the right physical placement.

Design ingredients¶

A graph model of the job's communication. Framework must expose it (e.g. parallelism plan from Megatron-LM / FSDP).
A graph model of the fabric. Zone / rack / slot tree.
A minimum-cut solver. Standard combinatorial optimisation; the job graph is small (thousands of ranks).
Rank-assignment API. Schedule must let the scheduler dictate MPI rank → physical GPU, not just GPU count.

When to use¶

Hierarchical fabric with oversubscribed layers. If the fabric is non-blocking end-to-end, there's no placement gain. The pattern pays off when cross-region links are scarce.
Job size > one region. Small jobs fit inside a Zone; placement is irrelevant.
Parallelism plan is known at schedule time. The scheduler needs to see the communication shape.
The scheduler already owns placement. If rank assignment is done in user land (e.g. mpirun -hostlist), the scheduler can't help.

When not to use¶

Uniform non-blocking fabric. Topology awareness buys nothing.
Workloads with irregular, hard-to-predict communication. The min-cut assumes the graph is stable; reinforcement-learning-style workloads may violate this.
Very small jobs. The planning overhead exceeds the benefit.

Relationship to adjacent ideas¶

concepts/collective-communication-topology-awareness — this pattern is the scheduler-side expression; the concept covers both the algorithm-side (collective choice) and the physical-side (placement).
concepts/ecmp-equal-cost-multipath — min-cut placement reduces the amount of traffic that hits oversubscribed ECMP'd links. It's not a replacement for good routing, it's a reducer of the load that routing has to handle.
patterns/dedicated-backend-training-fabric — the enclosing architectural choice that creates the Zone hierarchy in the first place.
HPC job placement — same idea has existed in HPC workload managers for decades (SLURM's --switches option); training-ML is rediscovering it at larger scale.

Wiki instances¶

Meta ATSW cross-AI-Zone scheduling (2024). Canonical wiki reference. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale.)

concepts/collective-communication-topology-awareness.
concepts/ecmp-equal-cost-multipath / concepts/fat-flow-load-balancing.
systems/ai-zone — the topology the pattern operates over.
systems/roce-rdma-over-converged-ethernet.
patterns/dedicated-backend-training-fabric.
companies/meta.