PATTERN Cited by 1 source
Minimum-cut training job placement¶
Context¶
Training fabrics for LLM-scale jobs often exceed the size of a single non-blocking AI Zone / pod. The fix is an aggregator layer (ATSW in Meta's terms) that stitches Zones into a data-center-scale fabric — but that layer is typically oversubscribed because making it non-blocking is prohibitively expensive.
Oversubscribed links under training workloads are dangerous: a single fat flow crossing them saturates the inter-Zone bandwidth budget. If a training job lands randomly across Zones, the cross-Zone traffic is dominated by luck.
Pattern¶
Make the training-job scheduler topology-aware. When a job is too large for a single Zone, the scheduler computes a minimum-cut partition of the job's ranks into Zones — minimising the number of collective-communication edges that cross the oversubscribed aggregator layer — then recommends a rank assignment that lands communicating peers in the same Zone when possible.
Two mechanics:
- Minimum-cut partitioning — model the job's communication graph (parallelism axes, collective peers); partition it into Zone-sized subsets to minimise cross-Zone edges.
- Topology-aware rank assignment — learn each GPU's logical position (Zone / rack / slot) and assign MPI/NCCL ranks so that adjacent-rank collectives (e.g. ring neighbours) fall within the same fabric region.
Meta's 2024-08-05 framing¶
"Note, the cross-AI Zone connectivity is oversubscribed by design, with network traffic balanced using ECMP. To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Why this is load-bearing¶
Without topology awareness, the cross-Zone link bandwidth is the per-step latency floor for any job that spans Zones. A poorly-partitioned job can spend an order of magnitude more time on cross-Zone traffic than a well-partitioned one, even on the same fabric.
With topology awareness, the scheduler aligns the parallelism structure with the network structure:
- Tensor-parallel all-reduces (latency-sensitive, small messages) stay within a Zone.
- Data-parallel all-reduces (larger, less frequent) can cross Zones if needed.
- Pipeline-parallel point-to-point handoffs are low-volume and can cross Zones cheaply.
This is the scheduler-side expression of concepts/collective-communication-topology-awareness — the algorithm-side is choosing the right collective algorithm, the scheduler-side is choosing the right physical placement.
Design ingredients¶
- A graph model of the job's communication. Framework must expose it (e.g. parallelism plan from Megatron-LM / FSDP).
- A graph model of the fabric. Zone / rack / slot tree.
- A minimum-cut solver. Standard combinatorial optimisation; the job graph is small (thousands of ranks).
- Rank-assignment API. Schedule must let the scheduler dictate MPI rank → physical GPU, not just GPU count.
When to use¶
- Hierarchical fabric with oversubscribed layers. If the fabric is non-blocking end-to-end, there's no placement gain. The pattern pays off when cross-region links are scarce.
- Job size > one region. Small jobs fit inside a Zone; placement is irrelevant.
- Parallelism plan is known at schedule time. The scheduler needs to see the communication shape.
- The scheduler already owns placement. If rank assignment is done in user land (e.g.
mpirun -hostlist), the scheduler can't help.
When not to use¶
- Uniform non-blocking fabric. Topology awareness buys nothing.
- Workloads with irregular, hard-to-predict communication. The min-cut assumes the graph is stable; reinforcement-learning-style workloads may violate this.
- Very small jobs. The planning overhead exceeds the benefit.
Relationship to adjacent ideas¶
- concepts/collective-communication-topology-awareness — this pattern is the scheduler-side expression; the concept covers both the algorithm-side (collective choice) and the physical-side (placement).
- concepts/ecmp-equal-cost-multipath — min-cut placement reduces the amount of traffic that hits oversubscribed ECMP'd links. It's not a replacement for good routing, it's a reducer of the load that routing has to handle.
- patterns/dedicated-backend-training-fabric — the enclosing architectural choice that creates the Zone hierarchy in the first place.
- HPC job placement — same idea has existed in HPC workload managers for decades (SLURM's
--switchesoption); training-ML is rediscovering it at larger scale.
Wiki instances¶
- Meta ATSW cross-AI-Zone scheduling (2024). Canonical wiki reference. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale.)