CONCEPT Cited by 2 sources
Collective-communication topology awareness¶
Definition¶
Collective-communication topology awareness is the practice of choosing collective-operation algorithms (all-reduce, all-gather, reduce-scatter, broadcast) whose communication graph matches the physical topology of the underlying network fabric, rather than using a default algorithm (typically ring all-reduce) that assumes a flat network.
The canonical default is the ring all-reduce (as in NCCL's ring algorithm): every GPU communicates with its two topology neighbours, traversing the ring twice. It's bandwidth-optimal under idealised conditions but latency-sensitive — every step waits for the previous one.
The canonical alternatives at scale are recursive doubling (halving) — algorithms whose communication graph is a tree / hypercube / butterfly rather than a ring. These trade higher per-step bandwidth for much lower latency as cluster size grows.
Meta's framing (2024-06)¶
Meta makes this optimisation one of the three it applied to bring both its 24K-GPU RoCE and InfiniBand clusters to equivalent GenAI-workload performance:
"We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. We do this by changing the default implementation of collectives with custom algorithms such as recursive doubling or halving instead of conventional algorithms like rings." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)
The operative word is latency-sensitive. A ring all-reduce across 24K GPUs has path length 2 × (24K - 1) hops — each hop pays network latency. A recursive-halving/doubling all-reduce has path length ~2 × log₂(24K) ≈ 30 hops. For bandwidth-bound messages, the ring is fine; for latency-sensitive collectives, the algorithmic choice dominates.
The three paradigmatic collective algorithms¶
| Algorithm | Path length | Bandwidth per step | Good for |
|---|---|---|---|
| Ring all-reduce | O(N) | 2(N-1)/N × message size | Bandwidth-bound, flat-fabric |
| Recursive halving + doubling (Rabenseifner) | O(log N) | message size | Small-to-medium messages, hierarchical fabric |
| Tree broadcast / reduce | O(log N) | message size | Broadcast / reduce collectives, hierarchical fabric |
Real-world implementations often compose: a recursive-doubling phase for inter-rack collectives, a ring phase for intra-rack where bandwidth dominates.
Why this goes with concepts/3d-parallelism¶
3D-parallel training has multiple collective types running concurrently:
- DP all-reduce — once per step, large message — ring-friendly.
- TP per-layer all-reduce — many per step, smaller messages — latency-sensitive.
- PP activation handoffs — point-to-point, not collective.
Meta's first optimisation (parallelism-axis → topology-layer mapping) and this second optimisation are two sides of the same coin: figure out which collective pattern each parallelism axis produces, route that pattern to the topology layer it fits, and pick the algorithm that fits the resulting latency-bandwidth regime.
Seen in¶
- sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — canonical wiki reference; Meta names recursive doubling/halving as the custom-algorithm replacement for ring, applied to both 24K-GPU H100 clusters.
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — the SIGCOMM 2024 companion; extends topology-awareness to the scheduler layer (minimum-cut partitioning for cross-AI-Zone placement, rank assignment by logical topology position). See patterns/minimum-cut-training-job-placement.
Related¶
- concepts/3d-parallelism — the parallelism structure whose collectives this concept optimises.
- concepts/data-parallelism / concepts/pipeline-parallelism.
- concepts/fat-flow-load-balancing — the sibling Meta optimisation on the routing side.
- systems/infiniband / systems/roce-rdma-over-converged-ethernet — the fabrics whose topologies this optimisation is matched to.
- systems/ai-zone — Meta's topology template that topology-aware placement operates over.
- systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the deployments where Meta applied it.
- patterns/minimum-cut-training-job-placement — the scheduler-side expression of topology awareness.