Skip to content

CONCEPT Cited by 1 source

ECMP (Equal-Cost Multipath)

Definition

ECMP (Equal-Cost Multipath) is the standard Ethernet/IP load-balancing primitive for Clos / leaf-spine fabrics: when multiple next-hops have the same routing cost to reach a destination, a switch chooses one by hashing a small set of packet-header fields — typically the 5-tuple (src_ip, dst_ip, src_port, dst_port, proto) — and modding by the number of available paths.

Properties that matter in practice:

  • Per-flow, not per-packet. All packets with the same 5-tuple take the same path. This preserves in-order delivery for TCP/RoCE without reorder handling.
  • Stateless at the switch. No per-flow table; the hash is recomputed on each packet.
  • Statistical uniformity under many-small-flows. The assumption baked into the design is that the number of concurrent flows is much larger than the number of paths, so the hash distributes uniformly by the law of large numbers.

Why it fails for AI training

AI training traffic violates all three of ECMP's implicit assumptions, as Meta's 2024-08-05 SIGCOMM paper enumerates:

  1. Low entropy — the NCCL process has a small, stable set of peer connections per rank. Few 5-tuples ⇒ few distinct hashes ⇒ pathological collisions are likely, not exceptional.
  2. Elephant flows — each connection, when active, saturates the NIC line rate. Two colliding flows halve each other's throughput.
  3. Burstiness — flows are on/off at millisecond granularity, so hash-based pinning has no chance to average out.

"We initially considered the widely adopted ECMP, which places flows randomly based on the hashes on the five-tuple. … However, and as expected, ECMP rendered poor performance for the training workload due to the low flow entropy." (Source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

This is the canonical starting point of the concepts/fat-flow-load-balancing problem.

Meta's evolution past baseline ECMP

Stage Scheme Outcome
v0 Baseline ECMP Poor — flow pinning from low entropy
v1 Path-pinning (per-destination-slice) >30% perf degradation under fragmented rack allocation + link failures
v1.5 2× RTSW uplink overprovisioning Mitigates v1 but 2× capital cost
v2 [QP](<enhanced-ecmp-qp-scaling E-ECMP with [[concepts/rdma-queue-pair.md>) scaling]]

Each stage trades off a different axis: path-pinning removes hash randomness but assumes clean job placement; overprovisioning is brute-force; E-ECMP+QP adds controlled entropy back via the RDMA QP field.

Alternatives / adjacent techniques

  • Adaptive routing — per-packet path decisions using congestion telemetry. Native on InfiniBand; some Ethernet vendors (Broadcom Scale Out, Cisco DLB) offer Ethernet equivalents.
  • Packet spraying — per-packet random placement, reorder-tolerant receivers required.
  • Flowlet-based ECMP — detect inactivity gaps; re-hash each flowlet. Helps for bursty flows with natural idle periods.
  • E-ECMP with QP field — Meta's approach.
  • Explicit routing / TE — application-level path selection (SDN).

Seen in

Last updated · 319 distilled / 1,201 read