Skip to content

CONCEPT Cited by 2 sources

Collective-communication topology awareness

Definition

Collective-communication topology awareness is the practice of choosing collective-operation algorithms (all-reduce, all-gather, reduce-scatter, broadcast) whose communication graph matches the physical topology of the underlying network fabric, rather than using a default algorithm (typically ring all-reduce) that assumes a flat network.

The canonical default is the ring all-reduce (as in NCCL's ring algorithm): every GPU communicates with its two topology neighbours, traversing the ring twice. It's bandwidth-optimal under idealised conditions but latency-sensitive — every step waits for the previous one.

The canonical alternatives at scale are recursive doubling (halving) — algorithms whose communication graph is a tree / hypercube / butterfly rather than a ring. These trade higher per-step bandwidth for much lower latency as cluster size grows.

Meta's framing (2024-06)

Meta makes this optimisation one of the three it applied to bring both its 24K-GPU RoCE and InfiniBand clusters to equivalent GenAI-workload performance:

"We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. We do this by changing the default implementation of collectives with custom algorithms such as recursive doubling or halving instead of conventional algorithms like rings." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

The operative word is latency-sensitive. A ring all-reduce across 24K GPUs has path length 2 × (24K - 1) hops — each hop pays network latency. A recursive-halving/doubling all-reduce has path length ~2 × log₂(24K) ≈ 30 hops. For bandwidth-bound messages, the ring is fine; for latency-sensitive collectives, the algorithmic choice dominates.

The three paradigmatic collective algorithms

Algorithm Path length Bandwidth per step Good for
Ring all-reduce O(N) 2(N-1)/N × message size Bandwidth-bound, flat-fabric
Recursive halving + doubling (Rabenseifner) O(log N) message size Small-to-medium messages, hierarchical fabric
Tree broadcast / reduce O(log N) message size Broadcast / reduce collectives, hierarchical fabric

Real-world implementations often compose: a recursive-doubling phase for inter-rack collectives, a ring phase for intra-rack where bandwidth dominates.

Why this goes with concepts/3d-parallelism

3D-parallel training has multiple collective types running concurrently:

  • DP all-reduce — once per step, large message — ring-friendly.
  • TP per-layer all-reduce — many per step, smaller messages — latency-sensitive.
  • PP activation handoffs — point-to-point, not collective.

Meta's first optimisation (parallelism-axis → topology-layer mapping) and this second optimisation are two sides of the same coin: figure out which collective pattern each parallelism axis produces, route that pattern to the topology layer it fits, and pick the algorithm that fits the resulting latency-bandwidth regime.

Seen in

Last updated · 319 distilled / 1,201 read