CONCEPT Cited by 1 source
Priority Flow Control (PFC)¶
Definition¶
Priority Flow Control (PFC), defined in IEEE 802.1Qbb, is a link-level flow-control mechanism that lets a switch tell its upstream neighbour to pause traffic per-traffic-class (up to 8 priorities). When a switch's ingress queue for priority P is about to overflow, it emits a PFC PAUSE frame for priority P to the link partner; the partner stops transmitting on that priority until the pause expires or is resumed.
Properties:
- Lossless. The whole point — enable protocols that assume no packet loss (RDMA) on Ethernet.
- Per-priority. One priority class can pause without starving the other seven.
- Hop-by-hop, not end-to-end. PFC affects only the immediate upstream neighbour. Propagation across multiple hops happens indirectly as upstream switches themselves fill up and emit their own pauses.
- No rate control. PFC is on/off at millisecond granularity; it doesn't adjust a sender's rate.
Why PFC exists for RoCE¶
RDMA was designed on InfiniBand, which is natively lossless. RoCE runs RDMA over Ethernet — which is lossy by default. For RDMA semantics (especially hardware-offloaded collectives, zero-copy) to survive the port, the Ethernet link must also be lossless. PFC is the standard mechanism that makes it so.
Without PFC, RoCE packet loss would force retransmits that most RDMA NICs handle inefficiently or not at all ("go-back-N" semantics on reliable-connected QPs); throughput would collapse under congestion.
Known failure modes¶
Running a lossless fabric has its own pathologies, which is why PFC-alone is usually not considered sufficient — and why DCQCN was invented as an end-to-end layer on top:
- Head-of-line blocking. Priority P's pause can starve unrelated flows queued behind the paused frames at a sender.
- PFC storms / deadlocks. Cyclic pause propagation can freeze parts of the fabric (well-documented failure class in production RoCE deployments).
- PFC pause amplification. A single slow receiver can cause pause propagation all the way back to the sender across many hops.
- Coarse-grained. PFC isn't a rate adjustment — it's an on/off switch, so fine-grained adjustment isn't possible without an additional end-to-end layer.
The classical fix is DCQCN. Meta's 2024-08-05 SIGCOMM finding is that in their specific training-workload regime, PFC alone + library-level admission is enough.
Meta's surprising retention of PFC-only¶
Meta disabled DCQCN when moving to 400G but kept PFC on, and reports in the 2024-08-05 SIGCOMM paper:
"At this time, we have had over a year of experience with just PFC for flow control, without any other transport-level congestion control. We have observed stable performance and lack of persistent congestion for training collectives."
And on whether PFC storms materialised:
"Despite turning off DCQCN and multiple instances of RTSW sending PFC to a deep-buffer CTSW, we have not encountered a scenario over the last four years where production AI training traffic causes the CTSW to send PFCs to RTSWs persistently."
Two pieces make this work:
- Deep-buffer CTSW spine. Absorbs transient PFC-driven backpressure without propagating it upstream.
- NCCL receiver-driven admission. Bounds in-flight traffic at the library layer, so the fabric never sees sustained overload.
(Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
When PFC alone is likely not enough¶
- Multi-tenant RDMA workloads (no library admission).
- Shallow-buffer switches — backpressure propagates far and fast.
- High-radix fan-in topologies where cycle formation risk is high.
- Mixed storage + compute on the same fabric priorities.
Seen in¶
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; Meta describes >1 year of stable PFC-only operation on the 400G RoCE training fabric.
Related¶
- concepts/dcqcn — the end-to-end congestion control PFC is usually paired with; Meta disabled it.
- concepts/receiver-driven-traffic-admission — the library-level substitute Meta uses instead.
- concepts/fat-flow-load-balancing — the routing-side first-order problem in the same fabric.
- systems/roce-rdma-over-converged-ethernet — the fabric PFC enables.
- systems/ai-zone — the topology in which Meta runs PFC-only.
- patterns/collective-library-transport-codesign.