PATTERN Cited by 1 source

Collective-library / transport co-design¶

Context¶

In hyperscale AI training, the collective-communication library (NCCL / RCCL) and the network transport (RoCE / InfiniBand) are traditionally treated as separate layers: the library issues generic sends / receives, the transport handles the network.

This decomposition breaks down at the scale of 400G RoCE training fabrics:

Routing — a hash-based ECMP primitive can't spread the library's few long-lived flows without cooperation.
Congestion control — DCQCN has workload-specific failure modes; firmware-only CC isn't enough.
Admission — the fabric has no principled way to bound in-flight data without knowing the application's communication shape.

Each layer has information the other needs. Solving any of the above in one layer alone requires either heroic over-provisioning or pessimistic tuning.

Pattern¶

Design the collective library and the transport as a single system. Each layer contributes specific capabilities; no single layer is expected to be complete on its own.

Concretely, Meta's realisation of the pattern (2024-08-05 SIGCOMM paper):

Layer	Contribution
Switch routing	E-ECMP — hash on RoCE QP field via UDF
Switch QoS	High-priority queue for CTS packets
Switch link layer	PFC only; no DCQCN
Deep-buffer spine	Absorbs PFC bursts without persistent propagation
NCCL (library)	Multi-QP message spreading (round-robin) for hash entropy
NCCL (library)	Receiver-driven CTS admission as the primary CC
Scheduler	Topology-aware rank assignment (minimum-cut Zone partitioning)

Each of these was adopted in response to a real failure mode, and each depends on the others to work.

Meta's 2024-08-05 framing¶

"To mitigate the congestion for 400G and beyond, we co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance."

And the explicit acknowledgement that the result is a single system, not two:

"Our current solution depends on careful coordination between the collective communication library and the network."

(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why decomposition fails¶

Each layer alone has a hard ceiling:

Transport-only CC — DCQCN relies on sender-NIC state machines that Meta found unreliable at 400G; firmware visibility/bugs introduced regressions. No library awareness ⇒ no application-shape knowledge.
Library-only CC — NCCL can implement CTS handshakes, but if the fabric doesn't prioritise CTS, they queue behind data and the pipeline stalls.
Switch-only load balancing — raising hash input without library cooperation doesn't help when the library only creates one QP per peer.

When to use¶

You control both sides. Library source (NCCL / custom fork) + switch config + NIC firmware. If any of those is opaque vendor territory, some of the mechanisms aren't available.
Training workload is the dominant use of the fabric. Co-design optimises for one workload shape; mixing storage or serving onto the same fabric loses the advantage.
Scale > a few thousand GPUs. At smaller scale the tuning overhead isn't worth it.
Organisation tolerates per-workload tuning. E-ECMP QP count, NCCL channel count and buffer size, CTS priority class — none of these are one-and-done.

When not to use¶

Multi-tenant RDMA workloads. Different tenants' libraries disagree on admission discipline; co-design assumes a coherent application-side regime.
Small-scale / early deployment. Off-the-shelf NCCL + DCQCN reaches acceptable performance up to moderate scale. Co-design is a late-stage optimisation.
No deep-buffer spine available. Meta's PFC-only posture depends on it; without it, turning off DCQCN is risky.

Generalisation beyond RoCE/NCCL¶

The same shape applies wherever a library-level abstraction sits atop a performance-critical transport:

Storage RDMA — file system (e.g. Lustre, BeeGFS) co-designed with fabric.
Serving-side KV-cache transfer — see concepts/rdma-kv-transfer (Mooncake Transfer Engine in Cloudflare Workers AI).
HPC MPI libraries — MPICH / Open MPI tuning on InfiniBand.

In each case the same rule holds: don't solve at one layer what can only be solved across two.

Wiki instances¶

Meta 400G RoCE training fabric (2024). Canonical wiki reference — the paper is explicitly framed as library+transport co-design. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale.)

concepts/receiver-driven-traffic-admission — the CC substitute that requires co-design.
concepts/enhanced-ecmp-qp-scaling — the routing substitute that requires co-design.
concepts/dcqcn / concepts/priority-flow-control — the transport-layer primitives in play.
concepts/fat-flow-load-balancing — the problem class that exposes single-layer limits.
concepts/collective-communication-topology-awareness — the sibling pattern on the algorithm side.
systems/ai-zone / systems/roce-rdma-over-converged-ethernet — the fabric on which Meta runs this.
patterns/dedicated-backend-training-fabric — the enclosing architectural choice that makes co-design feasible.
companies/meta.