Skip to content

CONCEPT Cited by 1 source

Receiver-driven traffic admission

Definition

Receiver-driven traffic admission is a congestion-control discipline in which the receiver (not the sender) decides when traffic may enter the network. The sender is blocked until the receiver explicitly grants a transmission by sending a clear-to-send (CTS) packet containing the acceptable message size and the target memory location. It is the opposite of sender-driven send-anytime-and-handle-congestion-later, which is the traditional TCP (and default RDMA) model.

In the GPU-to-GPU collective-communication context, receiver-driven admission is implemented by the collective library (NCCL) using its own control packets, and the underlying transport (RoCE) is reduced to lossless transport + link-level flow control — no transport-layer congestion-control state machine required.

Meta's NCCL mechanism (2024-08-05 SIGCOMM)

The 2024-08-05 post describes the concrete dataflow:

  1. The sender GPU's compute threads copy data from the compute buffer to an available channel buffer in HBM.
  2. The sender's CPU proxy thread waits — it cannot issue an RDMA write yet.
  3. Separately, the receiver issues a CTS packet to the sender carrying the size and memory-location information the write should target.
  4. The sender's CPU proxy now posts an RDMA write.
  5. On arrival, the receiver's GPU threads copy channel buffer → destination compute buffer.
  6. Both CPU proxies recycle the channel buffer; the receiver sends the next CTS once ready.

"The sender CPU proxy thread can only post an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which includes the size and memory information." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

The number of channels times the channel buffer size is the total in-flight traffic budget per GPU pair. Receivers gate issuance of CTS by channel-buffer availability, so the fabric never sees more in-flight data than the receivers can absorb.

Why this replaces transport-layer congestion control

The elegance of the scheme: at steady state, receiver-driven admission self-regulates.

  • When the receiver is fast → CTS issued quickly → pipeline stays full → bandwidth is utilised.
  • When the receiver is slow → CTS delayed → sender waits → fabric sees no overload buildup.
  • When the fabric is congested → ACKs delayed → channel buffer not recycled promptly → CTS delayed → sender backs off.

This is concepts/backpressure expressed at the collective-library layer rather than the transport layer. It removes the need for per-sender rate-control state machines like DCQCN — the library's natural handshake is the rate control.

Tuning challenges

Meta calls out two tuning knobs that are non-trivial:

  1. Number of channels. More channels = more pipelining, but they contend with compute operations for GPU thread resources. You can't just crank it.
  2. Channel buffer size. Too small → bandwidth underutilisation (pipeline stalls waiting for recycle). Too large → congestion can spread across the fabric before admission control reacts, "due to RoCE's more coarse-grained flow control and possible end-host slowness."

Meta's approach: experimentally calibrate both across training job sizes and collective types. The tuning is workload-specific.

The critical-path optimisation: CTS prioritisation

A subtle but load-bearing detail: CTS packets themselves are small control packets that can get queued behind data packets, delaying admission and starving the pipeline. Meta's fix is to prioritise CTS at switches:

"We implemented high priority queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth starvation."

This is a classic example of making the signal of a control-plane-over-data-plane scheme itself preemptive over the data it's supposed to rate-limit.

Why this feels like library–transport co-design

Receiver-driven admission requires:

  • Collective library emits CTS handshakes as part of the wire protocol (NCCL side).
  • Fabric provides a priority class for CTS packets (switch QoS config).
  • Buffers for admission to be meaningful (deep-buffer CTSW).
  • Lossless link layer so CTS and data both arrive reliably (PFC).

No single layer is sufficient alone. This is the canonical instance of patterns/collective-library-transport-codesign — the library and the fabric are designed as one system, with roles for each.

When this doesn't generalise

Meta explicitly cautions that the DCQCN-off + receiver-driven-admission combination "may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios." Workloads where the scheme may break:

  • Storage RDMA — no library-level admission counterpart.
  • Multi-tenant workloads — many mutually-uncoordinated senders; no collective admission graph.
  • GPU ≫ network — if receivers can't issue CTS fast enough to keep the NIC fed, you'd rather saturate at the transport layer.

Seen in

Last updated · 319 distilled / 1,201 read