Skip to content

CONCEPT Cited by 1 source

RDMA Queue Pair

Definition

In RDMA (both InfiniBand and RoCE), a Queue Pair (QP) is the fundamental transport-endpoint abstraction: a paired Send Queue + Receive Queue on each side of a connection, through which applications post WRITE, READ, SEND, and RECV work requests. Every RDMA packet carries a destination QP number in its transport header, so the receiving NIC can dispatch packets to the right endpoint.

From a protocol standpoint a QP is roughly analogous to a TCP connection; from a fabric standpoint it's a hashable identifier — which is what makes it relevant to load balancing.

Why this matters for AI training fabrics

Default Ethernet ECMP hashes the 5-tuple (src/dst IP, src/dst UDP port, proto). AI training workloads have very few distinct 5-tuples because the NCCL process uses a small, stable set of connections to its collective peers. Few distinct tuples ⇒ few distinct hashes ⇒ a handful of paths pinned. This is the fat-flow problem.

The QP field is extra entropy that's natively part of every RoCE packet:

  • Two RoCE packets on the same 5-tuple but targeting different QPs are different transport endpoints — they can be legitimately sent down different paths.
  • Switches can include the QP field in the ECMP hash if their ASIC supports UDF (User-Defined Field) hashing.

Meta exploits this as E-ECMP with QP scaling, which hashes on the QP field and changes the collective library (NCCL) to spread each hierarchical-collective message across multiple QPs, multiplying the effective flow count.

Two scaling strategies Meta evaluated

Meta's 2024-08-05 SIGCOMM post evaluates two ways to use multiple QPs:

  1. Split each message across multiple QPs. Smaller per-QP messages; many ACKs on the fabric. More wire overhead per unit work.
  2. Round-robin: send each whole message to a different QP. One message per QP; per-QP size unchanged.

For NCCL message sizes seen in production, strategy 2 (round-robin) won. Meta reports up to 40% improvement on AllReduce over baseline ECMP with this combination.

"The first involved splitting each message meant to be posted over a single QP, instead onto multiple QPs resulting in multiple flows. But it also produced smaller message sizes on fabric as well as multiple ACKs. The second approach involved posting each message to a different queue, in a round-robin fashion. For the NIC message sizes demonstrated in our production with NCCL, we observed the latter to be performing well." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Tradeoffs / caveats

  • Still probabilistic. Adding QP to the hash raises entropy but doesn't guarantee an optimal assignment — pathological hash collisions are still possible.
  • Per-workload tuning. The right number of QPs depends on message sizes, collective types, and topology. Meta explicitly calls this out as "long-term operational complexity."
  • Requires UDF-capable switches. Legacy switches can't hash on packet fields outside the standard 5-tuple.
  • Requires collective-library cooperation. QP scaling is done in NCCL-level code, not transparently at the NIC — it's a co-design between library and fabric.

Seen in

Last updated · 319 distilled / 1,201 read