Skip to content

CONCEPT Cited by 1 source

DCQCN (Data Center Quantized Congestion Notification)

Definition

DCQCN is the de-facto standard end-to-end congestion-control algorithm for RoCE networks. It combines ECN (Explicit Congestion Notification) marking at switches with a rate-control loop at sending NICs:

  1. Switch marks packets with ECN when queue depth crosses a threshold.
  2. Receiver NIC converts ECN-marked packets into CNPs (Congestion Notification Packets) sent back to the sender.
  3. Sender NIC runs a rate-adjustment state machine — quickly back off on CNPs, gradually probe up when none arrive.

DCQCN was designed for storage-RDMA workloads (many smaller connections, shorter messages, mixed read/write) and became the industry gold standard for RoCE congestion control in data centers. Running RoCE without DCQCN has historically been unusual.

Meta's counterintuitive result: DCQCN off at 400G

Meta's 2024-08-05 SIGCOMM paper reports that when moving from 200G to 400G RoCE deployments, they turned DCQCN off and kept it off for over a year on the production training fabric. Four pieces of evidence for why:

  1. Default DCQCN settings degraded performance with the 400G ECN thresholds tuned relative to 200G.
  2. Firmware-side DCQCN changed, introducing bugs and "reduced visibility with problems relating to correct CNP counting."
  3. PFC alone kept the fabric stable. "We have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives."
  4. The deep-buffer CTSW absorbs what PFC can't. Meta explicitly notes: "Despite turning off DCQCN and multiple instances of RTSW sending PFC to a deep-buffer CTSW, we have not encountered a scenario over the last four years where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently."

(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why training workloads can live without DCQCN

Meta's hypothesis is that training traffic is self-regulating at the application layer in ways storage RDMA is not:

  • NCCL uses receiver-driven admission — the sender can only transmit after receiving a CTS packet from the receiver. This naturally bounds in-flight traffic.
  • Collective cadence is predictable and bursty rather than continuous. The fabric's deep buffers + PFC can absorb the bursts without persistent buildup.
  • The number of distinct senders is small (bounded by GPU fleet, not tenant count) — so the ensemble behaviour is less adversarial than multi-tenant storage.

Explicit caveat: this is Meta-specific

Meta does not generalise the DCQCN-off finding:

"Our current solution depends on careful coordination between the collective communication library and the network. It may depend on the relative throughput between GPU and network, which may not be applicable to all scenarios. We encourage the research community to put more focus on this topic."

Workloads where DCQCN probably should stay on:

  • Multi-tenant RDMA storage. No library-level admission control.
  • GPU:network ratios very different from Meta's. When GPU can saturate network much faster than receivers can gate, library admission isn't enough.
  • Shallower-buffer switches. Deep-buffer CTSW is a material assumption.

The substitution

Meta's RoCE congestion-control stack, end state:

Layer Mechanism Role
Link PFC Lossless; backpressure on switch queue fill
Transport None (DCQCN OFF) No rate-control state machine in firmware
Library (NCCL) Receiver-driven admission + CTS Bounds in-flight volume per GPU pair
Switch QoS High-priority queue for CTS packets Prevents admission signal from starving

This is a strong instance of patterns/collective-library-transport-codesign — neither PFC nor NCCL admission would be sufficient alone.

Seen in

Last updated · 319 distilled / 1,201 read