CONCEPT Cited by 1 source
Receiver-driven traffic admission¶
Definition¶
Receiver-driven traffic admission is a congestion-control discipline in which the receiver (not the sender) decides when traffic may enter the network. The sender is blocked until the receiver explicitly grants a transmission by sending a clear-to-send (CTS) packet containing the acceptable message size and the target memory location. It is the opposite of sender-driven send-anytime-and-handle-congestion-later, which is the traditional TCP (and default RDMA) model.
In the GPU-to-GPU collective-communication context, receiver-driven admission is implemented by the collective library (NCCL) using its own control packets, and the underlying transport (RoCE) is reduced to lossless transport + link-level flow control — no transport-layer congestion-control state machine required.
Meta's NCCL mechanism (2024-08-05 SIGCOMM)¶
The 2024-08-05 post describes the concrete dataflow:
- The sender GPU's compute threads copy data from the compute buffer to an available channel buffer in HBM.
- The sender's CPU proxy thread waits — it cannot issue an RDMA write yet.
- Separately, the receiver issues a CTS packet to the sender carrying the size and memory-location information the write should target.
- The sender's CPU proxy now posts an RDMA write.
- On arrival, the receiver's GPU threads copy channel buffer → destination compute buffer.
- Both CPU proxies recycle the channel buffer; the receiver sends the next CTS once ready.
"The sender CPU proxy thread can only post an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which includes the size and memory information." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
The number of channels times the channel buffer size is the total in-flight traffic budget per GPU pair. Receivers gate issuance of CTS by channel-buffer availability, so the fabric never sees more in-flight data than the receivers can absorb.
Why this replaces transport-layer congestion control¶
The elegance of the scheme: at steady state, receiver-driven admission self-regulates.
- When the receiver is fast → CTS issued quickly → pipeline stays full → bandwidth is utilised.
- When the receiver is slow → CTS delayed → sender waits → fabric sees no overload buildup.
- When the fabric is congested → ACKs delayed → channel buffer not recycled promptly → CTS delayed → sender backs off.
This is concepts/backpressure expressed at the collective-library layer rather than the transport layer. It removes the need for per-sender rate-control state machines like DCQCN — the library's natural handshake is the rate control.
Tuning challenges¶
Meta calls out two tuning knobs that are non-trivial:
- Number of channels. More channels = more pipelining, but they contend with compute operations for GPU thread resources. You can't just crank it.
- Channel buffer size. Too small → bandwidth underutilisation (pipeline stalls waiting for recycle). Too large → congestion can spread across the fabric before admission control reacts, "due to RoCE's more coarse-grained flow control and possible end-host slowness."
Meta's approach: experimentally calibrate both across training job sizes and collective types. The tuning is workload-specific.
The critical-path optimisation: CTS prioritisation¶
A subtle but load-bearing detail: CTS packets themselves are small control packets that can get queued behind data packets, delaying admission and starving the pipeline. Meta's fix is to prioritise CTS at switches:
"We implemented high priority queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth starvation."
This is a classic example of making the signal of a control-plane-over-data-plane scheme itself preemptive over the data it's supposed to rate-limit.
Why this feels like library–transport co-design¶
Receiver-driven admission requires:
- Collective library emits CTS handshakes as part of the wire protocol (NCCL side).
- Fabric provides a priority class for CTS packets (switch QoS config).
- Buffers for admission to be meaningful (deep-buffer CTSW).
- Lossless link layer so CTS and data both arrive reliably (PFC).
No single layer is sufficient alone. This is the canonical instance of patterns/collective-library-transport-codesign — the library and the fabric are designed as one system, with roles for each.
When this doesn't generalise¶
Meta explicitly cautions that the DCQCN-off + receiver-driven-admission combination "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios." Workloads where the scheme may break:
- Storage RDMA — no library-level admission counterpart.
- Multi-tenant workloads — many mutually-uncoordinated senders; no collective admission graph.
- GPU ≫ network — if receivers can't issue CTS fast enough to keep the NIC fed, you'd rather saturate at the transport layer.
Seen in¶
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; Meta details the NCCL CTS handshake and its use as admission control, plus the CTS-prioritisation detail.
Related¶
- concepts/backpressure — the umbrella abstraction.
- concepts/dcqcn — the transport-layer CC Meta substitutes away with this scheme.
- concepts/priority-flow-control — the link-layer primitive retained.
- concepts/rdma-queue-pair — the RDMA abstraction over which CTS and data flow.
- systems/roce-rdma-over-converged-ethernet / systems/ai-zone — the fabric and topology.
- patterns/collective-library-transport-codesign — the enclosing pattern.