Skip to content

PATTERN Cited by 1 source

Collective-library / transport co-design

Context

In hyperscale AI training, the collective-communication library (NCCL / RCCL) and the network transport (RoCE / InfiniBand) are traditionally treated as separate layers: the library issues generic sends / receives, the transport handles the network.

This decomposition breaks down at the scale of 400G RoCE training fabrics:

  • Routing — a hash-based ECMP primitive can't spread the library's few long-lived flows without cooperation.
  • Congestion controlDCQCN has workload-specific failure modes; firmware-only CC isn't enough.
  • Admission — the fabric has no principled way to bound in-flight data without knowing the application's communication shape.

Each layer has information the other needs. Solving any of the above in one layer alone requires either heroic over-provisioning or pessimistic tuning.

Pattern

Design the collective library and the transport as a single system. Each layer contributes specific capabilities; no single layer is expected to be complete on its own.

Concretely, Meta's realisation of the pattern (2024-08-05 SIGCOMM paper):

Layer Contribution
Switch routing E-ECMP — hash on RoCE QP field via UDF
Switch QoS High-priority queue for CTS packets
Switch link layer PFC only; no DCQCN
Deep-buffer spine Absorbs PFC bursts without persistent propagation
NCCL (library) Multi-QP message spreading (round-robin) for hash entropy
NCCL (library) Receiver-driven CTS admission as the primary CC
Scheduler Topology-aware rank assignment (minimum-cut Zone partitioning)

Each of these was adopted in response to a real failure mode, and each depends on the others to work.

Meta's 2024-08-05 framing

"To mitigate the congestion for 400G and beyond, we co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance."

And the explicit acknowledgement that the result is a single system, not two:

"Our current solution depends on careful coordination between the collective communication library and the network."

(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why decomposition fails

Each layer alone has a hard ceiling:

  • Transport-only CC — DCQCN relies on sender-NIC state machines that Meta found unreliable at 400G; firmware visibility/bugs introduced regressions. No library awareness ⇒ no application-shape knowledge.
  • Library-only CC — NCCL can implement CTS handshakes, but if the fabric doesn't prioritise CTS, they queue behind data and the pipeline stalls.
  • Switch-only load balancing — raising hash input without library cooperation doesn't help when the library only creates one QP per peer.

When to use

  • You control both sides. Library source (NCCL / custom fork) + switch config + NIC firmware. If any of those is opaque vendor territory, some of the mechanisms aren't available.
  • Training workload is the dominant use of the fabric. Co-design optimises for one workload shape; mixing storage or serving onto the same fabric loses the advantage.
  • Scale > a few thousand GPUs. At smaller scale the tuning overhead isn't worth it.
  • Organisation tolerates per-workload tuning. E-ECMP QP count, NCCL channel count and buffer size, CTS priority class — none of these are one-and-done.

When not to use

  • Multi-tenant RDMA workloads. Different tenants' libraries disagree on admission discipline; co-design assumes a coherent application-side regime.
  • Small-scale / early deployment. Off-the-shelf NCCL + DCQCN reaches acceptable performance up to moderate scale. Co-design is a late-stage optimisation.
  • No deep-buffer spine available. Meta's PFC-only posture depends on it; without it, turning off DCQCN is risky.

Generalisation beyond RoCE/NCCL

The same shape applies wherever a library-level abstraction sits atop a performance-critical transport:

  • Storage RDMA — file system (e.g. Lustre, BeeGFS) co-designed with fabric.
  • Serving-side KV-cache transfer — see concepts/rdma-kv-transfer (Mooncake Transfer Engine in Cloudflare Workers AI).
  • HPC MPI libraries — MPICH / Open MPI tuning on InfiniBand.

In each case the same rule holds: don't solve at one layer what can only be solved across two.

Wiki instances

Last updated · 319 distilled / 1,201 read