CONCEPT Cited by 1 source

Receiver-driven traffic admission¶

Definition¶

Receiver-driven traffic admission is a congestion-control discipline in which the receiver (not the sender) decides when traffic may enter the network. The sender is blocked until the receiver explicitly grants a transmission by sending a clear-to-send (CTS) packet containing the acceptable message size and the target memory location. It is the opposite of sender-driven send-anytime-and-handle-congestion-later, which is the traditional TCP (and default RDMA) model.

In the GPU-to-GPU collective-communication context, receiver-driven admission is implemented by the collective library (NCCL) using its own control packets, and the underlying transport (RoCE) is reduced to lossless transport + link-level flow control — no transport-layer congestion-control state machine required.

Meta's NCCL mechanism (2024-08-05 SIGCOMM)¶

The 2024-08-05 post describes the concrete dataflow:

The sender GPU's compute threads copy data from the compute buffer to an available channel buffer in HBM.
The sender's CPU proxy thread waits — it cannot issue an RDMA write yet.
Separately, the receiver issues a CTS packet to the sender carrying the size and memory-location information the write should target.
The sender's CPU proxy now posts an RDMA write.
On arrival, the receiver's GPU threads copy channel buffer → destination compute buffer.
Both CPU proxies recycle the channel buffer; the receiver sends the next CTS once ready.

"The sender CPU proxy thread can only post an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which includes the size and memory information." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

The number of channels times the channel buffer size is the total in-flight traffic budget per GPU pair. Receivers gate issuance of CTS by channel-buffer availability, so the fabric never sees more in-flight data than the receivers can absorb.

Why this replaces transport-layer congestion control¶

The elegance of the scheme: at steady state, receiver-driven admission self-regulates.

When the receiver is fast → CTS issued quickly → pipeline stays full → bandwidth is utilised.
When the receiver is slow → CTS delayed → sender waits → fabric sees no overload buildup.
When the fabric is congested → ACKs delayed → channel buffer not recycled promptly → CTS delayed → sender backs off.

This is concepts/backpressure expressed at the collective-library layer rather than the transport layer. It removes the need for per-sender rate-control state machines like DCQCN — the library's natural handshake is the rate control.

Tuning challenges¶

Meta calls out two tuning knobs that are non-trivial:

Number of channels. More channels = more pipelining, but they contend with compute operations for GPU thread resources. You can't just crank it.
Channel buffer size. Too small → bandwidth underutilisation (pipeline stalls waiting for recycle). Too large → congestion can spread across the fabric before admission control reacts, "due to RoCE's more coarse-grained flow control and possible end-host slowness."

Meta's approach: experimentally calibrate both across training job sizes and collective types. The tuning is workload-specific.

The critical-path optimisation: CTS prioritisation¶

A subtle but load-bearing detail: CTS packets themselves are small control packets that can get queued behind data packets, delaying admission and starving the pipeline. Meta's fix is to prioritise CTS at switches:

"We implemented high priority queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth starvation."

This is a classic example of making the signal of a control-plane-over-data-plane scheme itself preemptive over the data it's supposed to rate-limit.

Why this feels like library–transport co-design¶

Receiver-driven admission requires:

Collective library emits CTS handshakes as part of the wire protocol (NCCL side).
Fabric provides a priority class for CTS packets (switch QoS config).
Buffers for admission to be meaningful (deep-buffer CTSW).
Lossless link layer so CTS and data both arrive reliably (PFC).

No single layer is sufficient alone. This is the canonical instance of patterns/collective-library-transport-codesign — the library and the fabric are designed as one system, with roles for each.

When this doesn't generalise¶

Meta explicitly cautions that the DCQCN-off + receiver-driven-admission combination "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios." Workloads where the scheme may break:

Storage RDMA — no library-level admission counterpart.
Multi-tenant workloads — many mutually-uncoordinated senders; no collective admission graph.
GPU ≫ network — if receivers can't issue CTS fast enough to keep the NIC fed, you'd rather saturate at the transport layer.

Seen in¶

sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; Meta details the NCCL CTS handshake and its use as admission control, plus the CTS-prioritisation detail.

concepts/backpressure — the umbrella abstraction.
concepts/dcqcn — the transport-layer CC Meta substitutes away with this scheme.
concepts/priority-flow-control — the link-layer primitive retained.
concepts/rdma-queue-pair — the RDMA abstraction over which CTS and data flow.
systems/roce-rdma-over-converged-ethernet / systems/ai-zone — the fabric and topology.
patterns/collective-library-transport-codesign — the enclosing pattern.