Skip to content

PATTERN Cited by 1 source

Minimum-cut training job placement

Context

Training fabrics for LLM-scale jobs often exceed the size of a single non-blocking AI Zone / pod. The fix is an aggregator layer (ATSW in Meta's terms) that stitches Zones into a data-center-scale fabric — but that layer is typically oversubscribed because making it non-blocking is prohibitively expensive.

Oversubscribed links under training workloads are dangerous: a single fat flow crossing them saturates the inter-Zone bandwidth budget. If a training job lands randomly across Zones, the cross-Zone traffic is dominated by luck.

Pattern

Make the training-job scheduler topology-aware. When a job is too large for a single Zone, the scheduler computes a minimum-cut partition of the job's ranks into Zones — minimising the number of collective-communication edges that cross the oversubscribed aggregator layer — then recommends a rank assignment that lands communicating peers in the same Zone when possible.

Two mechanics:

  1. Minimum-cut partitioning — model the job's communication graph (parallelism axes, collective peers); partition it into Zone-sized subsets to minimise cross-Zone edges.
  2. Topology-aware rank assignment — learn each GPU's logical position (Zone / rack / slot) and assign MPI/NCCL ranks so that adjacent-rank collectives (e.g. ring neighbours) fall within the same fabric region.

Meta's 2024-08-05 framing

"Note, the cross-AI Zone connectivity is oversubscribed by design, with network traffic balanced using ECMP. To mitigate the performance bottleneck for cross-AI Zone traffic, we enhanced the training job scheduler to find a 'minimum cut' when dividing the training nodes into different AI Zones, reducing the cross-AI Zone traffic and thus collective completion time. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why this is load-bearing

Without topology awareness, the cross-Zone link bandwidth is the per-step latency floor for any job that spans Zones. A poorly-partitioned job can spend an order of magnitude more time on cross-Zone traffic than a well-partitioned one, even on the same fabric.

With topology awareness, the scheduler aligns the parallelism structure with the network structure:

  • Tensor-parallel all-reduces (latency-sensitive, small messages) stay within a Zone.
  • Data-parallel all-reduces (larger, less frequent) can cross Zones if needed.
  • Pipeline-parallel point-to-point handoffs are low-volume and can cross Zones cheaply.

This is the scheduler-side expression of concepts/collective-communication-topology-awareness — the algorithm-side is choosing the right collective algorithm, the scheduler-side is choosing the right physical placement.

Design ingredients

  • A graph model of the job's communication. Framework must expose it (e.g. parallelism plan from Megatron-LM / FSDP).
  • A graph model of the fabric. Zone / rack / slot tree.
  • A minimum-cut solver. Standard combinatorial optimisation; the job graph is small (thousands of ranks).
  • Rank-assignment API. Schedule must let the scheduler dictate MPI rank → physical GPU, not just GPU count.

When to use

  • Hierarchical fabric with oversubscribed layers. If the fabric is non-blocking end-to-end, there's no placement gain. The pattern pays off when cross-region links are scarce.
  • Job size > one region. Small jobs fit inside a Zone; placement is irrelevant.
  • Parallelism plan is known at schedule time. The scheduler needs to see the communication shape.
  • The scheduler already owns placement. If rank assignment is done in user land (e.g. mpirun -hostlist), the scheduler can't help.

When not to use

  • Uniform non-blocking fabric. Topology awareness buys nothing.
  • Workloads with irregular, hard-to-predict communication. The min-cut assumes the graph is stable; reinforcement-learning-style workloads may violate this.
  • Very small jobs. The planning overhead exceeds the benefit.

Relationship to adjacent ideas

  • concepts/collective-communication-topology-awareness — this pattern is the scheduler-side expression; the concept covers both the algorithm-side (collective choice) and the physical-side (placement).
  • concepts/ecmp-equal-cost-multipath — min-cut placement reduces the amount of traffic that hits oversubscribed ECMP'd links. It's not a replacement for good routing, it's a reducer of the load that routing has to handle.
  • patterns/dedicated-backend-training-fabric — the enclosing architectural choice that creates the Zone hierarchy in the first place.
  • HPC job placement — same idea has existed in HPC workload managers for decades (SLURM's --switches option); training-ML is rediscovering it at larger scale.

Wiki instances

Last updated · 319 distilled / 1,201 read