PATTERN Cited by 1 source
Collective-library / transport co-design¶
Context¶
In hyperscale AI training, the collective-communication library (NCCL / RCCL) and the network transport (RoCE / InfiniBand) are traditionally treated as separate layers: the library issues generic sends / receives, the transport handles the network.
This decomposition breaks down at the scale of 400G RoCE training fabrics:
- Routing — a hash-based ECMP primitive can't spread the library's few long-lived flows without cooperation.
- Congestion control — DCQCN has workload-specific failure modes; firmware-only CC isn't enough.
- Admission — the fabric has no principled way to bound in-flight data without knowing the application's communication shape.
Each layer has information the other needs. Solving any of the above in one layer alone requires either heroic over-provisioning or pessimistic tuning.
Pattern¶
Design the collective library and the transport as a single system. Each layer contributes specific capabilities; no single layer is expected to be complete on its own.
Concretely, Meta's realisation of the pattern (2024-08-05 SIGCOMM paper):
| Layer | Contribution |
|---|---|
| Switch routing | E-ECMP — hash on RoCE QP field via UDF |
| Switch QoS | High-priority queue for CTS packets |
| Switch link layer | PFC only; no DCQCN |
| Deep-buffer spine | Absorbs PFC bursts without persistent propagation |
| NCCL (library) | Multi-QP message spreading (round-robin) for hash entropy |
| NCCL (library) | Receiver-driven CTS admission as the primary CC |
| Scheduler | Topology-aware rank assignment (minimum-cut Zone partitioning) |
Each of these was adopted in response to a real failure mode, and each depends on the others to work.
Meta's 2024-08-05 framing¶
"To mitigate the congestion for 400G and beyond, we co-designed the collective library and RoCE transport to enforce receiver-driven traffic admission for better performance."
And the explicit acknowledgement that the result is a single system, not two:
"Our current solution depends on careful coordination between the collective communication library and the network."
(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Why decomposition fails¶
Each layer alone has a hard ceiling:
- Transport-only CC — DCQCN relies on sender-NIC state machines that Meta found unreliable at 400G; firmware visibility/bugs introduced regressions. No library awareness ⇒ no application-shape knowledge.
- Library-only CC — NCCL can implement CTS handshakes, but if the fabric doesn't prioritise CTS, they queue behind data and the pipeline stalls.
- Switch-only load balancing — raising hash input without library cooperation doesn't help when the library only creates one QP per peer.
When to use¶
- You control both sides. Library source (NCCL / custom fork) + switch config + NIC firmware. If any of those is opaque vendor territory, some of the mechanisms aren't available.
- Training workload is the dominant use of the fabric. Co-design optimises for one workload shape; mixing storage or serving onto the same fabric loses the advantage.
- Scale > a few thousand GPUs. At smaller scale the tuning overhead isn't worth it.
- Organisation tolerates per-workload tuning. E-ECMP QP count, NCCL channel count and buffer size, CTS priority class — none of these are one-and-done.
When not to use¶
- Multi-tenant RDMA workloads. Different tenants' libraries disagree on admission discipline; co-design assumes a coherent application-side regime.
- Small-scale / early deployment. Off-the-shelf NCCL + DCQCN reaches acceptable performance up to moderate scale. Co-design is a late-stage optimisation.
- No deep-buffer spine available. Meta's PFC-only posture depends on it; without it, turning off DCQCN is risky.
Generalisation beyond RoCE/NCCL¶
The same shape applies wherever a library-level abstraction sits atop a performance-critical transport:
- Storage RDMA — file system (e.g. Lustre, BeeGFS) co-designed with fabric.
- Serving-side KV-cache transfer — see concepts/rdma-kv-transfer (Mooncake Transfer Engine in Cloudflare Workers AI).
- HPC MPI libraries — MPICH / Open MPI tuning on InfiniBand.
In each case the same rule holds: don't solve at one layer what can only be solved across two.
Wiki instances¶
- Meta 400G RoCE training fabric (2024). Canonical wiki reference — the paper is explicitly framed as library+transport co-design. (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale.)
Related¶
- concepts/receiver-driven-traffic-admission — the CC substitute that requires co-design.
- concepts/enhanced-ecmp-qp-scaling — the routing substitute that requires co-design.
- concepts/dcqcn / concepts/priority-flow-control — the transport-layer primitives in play.
- concepts/fat-flow-load-balancing — the problem class that exposes single-layer limits.
- concepts/collective-communication-topology-awareness — the sibling pattern on the algorithm side.
- systems/ai-zone / systems/roce-rdma-over-converged-ethernet — the fabric on which Meta runs this.
- patterns/dedicated-backend-training-fabric — the enclosing architectural choice that makes co-design feasible.
- companies/meta.