CONCEPT Cited by 1 source
Enhanced ECMP with Queue-Pair scaling¶
Definition¶
Enhanced ECMP (E-ECMP) with Queue-Pair scaling is a routing + collective-library co-design technique for getting good load balance out of an Ethernet training fabric without giving up ECMP. It has two parts:
- Switch side (E-ECMP): Configure switches to include the RoCE packet's destination Queue Pair field in the ECMP hash, on top of the standard 5-tuple. Switch ASICs expose this via UDF (User-Defined Field) hashing.
- Collective-library side (QP scaling): Change the collective communication library (NCCL) to spread each hierarchical-collective message across multiple QPs — increasing the number of distinct hash inputs the fabric sees from a single logical flow.
Together they raise effective flow entropy enough that ECMP's random assignment actually spreads traffic across paths.
Meta's framing (2024-08-05 SIGCOMM)¶
Meta introduces the technique to fix the low-entropy failure mode of baseline ECMP on AI training traffic:
"We configured switches to perform Enhanced ECMP (E-ECMP) to additionally hash on the destination QP field of a RoCE packet using the UDF capability of the switch ASIC. This increased entropy and, compared to baseline ECMP without QP scaling, we observed that E-ECMP along with QP scaling showed performance improvement of up to 40% for the AllReduce collective." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)
Two QP-scaling strategies (Meta evaluated both)¶
| Strategy | Mechanism | Wire effect | Meta's verdict |
|---|---|---|---|
| Split | Break one message across multiple QPs | Smaller messages + more ACKs per unit work | Extra fabric overhead |
| Round-robin | Send each whole message to a different QP | One message per QP, unchanged size | Picked — best for NCCL message sizes seen in production |
Measured gain¶
Up to +40% on the AllReduce collective compared to baseline ECMP without QP scaling. Measured on Meta's production RoCE training fabric.
Why this is a co-design, not a pure-switch fix¶
E-ECMP alone (just hashing on an extra field) doesn't help if the collective library still uses a single QP per peer — there's only one QP value in play, so the hash output is still deterministic per peer. The entropy comes from NCCL creating and rotating through multiple QPs. Neither the switch change nor the library change is sufficient alone. This is a canonical instance of patterns/collective-library-transport-codesign.
Tradeoffs and operational tax¶
- Still probabilistic. Hashing ⇒ no guaranteed optimality. Some collision rate remains.
- Per-workload tuning. The optimal QP count depends on message size, collective type, and topology. Meta explicitly calls this "long-term operational complexity."
- Switch-ASIC requirement. UDF hashing isn't universal — legacy switches can't do this.
- NCCL cooperation required. Collective library has to be built/tuned to spread across the configured QP count.
When a deterministic scheme might be better¶
Meta explicitly tried a deterministic scheme (concepts/path-pinning) before landing on E-ECMP + QP. Path-pinning is ideal when rack assignment is clean and the network is healthy; neither is durably true at 24K-GPU scale. E-ECMP + QP wins because it degrades gracefully under failures and fragmentation — worse average, fewer pathological cliffs.
Where this sits in the routing evolution¶
| Stage | Approach | Failure mode fixed | Mitigation cost |
|---|---|---|---|
| v0 | Baseline ECMP (5-tuple hash) | — | — |
| v1 | concepts/path-pinning | ECMP entropy; worked if placement was clean | >30% degradation under fragmentation |
| v1.5 | 2× RTSW uplink provisioning | Path-pinning fragmentation | 2× capital |
| v2 | E-ECMP + QP scaling | Entropy via QP field + NCCL multi-QP messaging | Per-workload QP tuning |
Seen in¶
- sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale — canonical wiki reference; the routing-evolution section culminates in this scheme with +40% AllReduce result.
Related¶
- concepts/ecmp-equal-cost-multipath — the baseline this augments.
- concepts/rdma-queue-pair — the entropy source.
- concepts/fat-flow-load-balancing — the umbrella problem.
- concepts/path-pinning — the approach Meta tried before this.
- concepts/collective-communication-topology-awareness — algorithm-side counterpart to this routing-side optimisation.
- systems/roce-rdma-over-converged-ethernet / systems/ai-zone — the fabric and topology this runs on.
- patterns/collective-library-transport-codesign — the enclosing pattern.