CONCEPT Cited by 2 sources

Collective-communication topology awareness¶

Definition¶

Collective-communication topology awareness is the practice of choosing collective-operation algorithms (all-reduce, all-gather, reduce-scatter, broadcast) whose communication graph matches the physical topology of the underlying network fabric, rather than using a default algorithm (typically ring all-reduce) that assumes a flat network.

The canonical default is the ring all-reduce (as in NCCL's ring algorithm): every GPU communicates with its two topology neighbours, traversing the ring twice. It's bandwidth-optimal under idealised conditions but latency-sensitive — every step waits for the previous one.

The canonical alternatives at scale are recursive doubling (halving) — algorithms whose communication graph is a tree / hypercube / butterfly rather than a ring. These trade higher per-step bandwidth for much lower latency as cluster size grows.

Meta's framing (2024-06)¶

Meta makes this optimisation one of the three it applied to bring both its 24K-GPU RoCE and InfiniBand clusters to equivalent GenAI-workload performance:

"We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. We do this by changing the default implementation of collectives with custom algorithms such as recursive doubling or halving instead of conventional algorithms like rings." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

The operative word is latency-sensitive. A ring all-reduce across 24K GPUs has path length 2 × (24K - 1) hops — each hop pays network latency. A recursive-halving/doubling all-reduce has path length ~2 × log₂(24K) ≈ 30 hops. For bandwidth-bound messages, the ring is fine; for latency-sensitive collectives, the algorithmic choice dominates.

The three paradigmatic collective algorithms¶

Algorithm	Path length	Bandwidth per step	Good for
Ring all-reduce	O(N)	2(N-1)/N × message size	Bandwidth-bound, flat-fabric
Recursive halving + doubling (Rabenseifner)	O(log N)	message size	Small-to-medium messages, hierarchical fabric
Tree broadcast / reduce	O(log N)	message size	Broadcast / reduce collectives, hierarchical fabric

Real-world implementations often compose: a recursive-doubling phase for inter-rack collectives, a ring phase for intra-rack where bandwidth dominates.

Why this goes with concepts/3d-parallelism ¶

3D-parallel training has multiple collective types running concurrently:

DP all-reduce — once per step, large message — ring-friendly.
TP per-layer all-reduce — many per step, smaller messages — latency-sensitive.
PP activation handoffs — point-to-point, not collective.

Meta's first optimisation (parallelism-axis → topology-layer mapping) and this second optimisation are two sides of the same coin: figure out which collective pattern each parallelism axis produces, route that pattern to the topology layer it fits, and pick the algorithm that fits the resulting latency-bandwidth regime.

Seen in¶

sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale — canonical wiki reference; Meta names recursive doubling/halving as the custom-algorithm replacement for ring, applied to both 24K-GPU H100 clusters.
sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — the SIGCOMM 2024 companion; extends topology-awareness to the scheduler layer (minimum-cut partitioning for cross-AI-Zone placement, rank assignment by logical topology position). See patterns/minimum-cut-training-job-placement.

concepts/3d-parallelism — the parallelism structure whose collectives this concept optimises.
concepts/data-parallelism / concepts/pipeline-parallelism.
concepts/fat-flow-load-balancing — the sibling Meta optimisation on the routing side.
systems/infiniband / systems/roce-rdma-over-converged-ethernet — the fabrics whose topologies this optimisation is matched to.
systems/ai-zone — Meta's topology template that topology-aware placement operates over.
systems/meta-genai-cluster-roce / systems/meta-genai-cluster-infiniband — the deployments where Meta applied it.
patterns/minimum-cut-training-job-placement — the scheduler-side expression of topology awareness.