Skip to content

CONCEPT Cited by 1 source

Path pinning (RoCE routing)

Definition

Path pinning is a routing scheme where specific packets are deterministically assigned to specific parallel paths through a multi-path fabric — the opposite of hash-based ECMP. The assignment function is typically some deterministic attribute of the destination: IP range, port number, application-defined slice index, or — in Meta's case — the destination "slice" (the index of the target rack switch's downlink).

The appeal for AI training is obvious: if you know how flows should distribute across paths (one flow per path, load balanced by construction), why let a hash randomise it?

Meta's deployment and failure (2024-08-05)

Meta deployed path pinning in the early years of their RoCE fabric as a first response to baseline-ECMP's failure on training traffic:

"Alternatively, we designed and deployed a path-pinning scheme in the initial years of our deployment. This scheme routed packets to specific paths based on the destination 'slice' (the index of the RTSW downlink). This worked well if each rack was fully assigned to the same job and there was no failure in the network."

The scheme is elegant when both preconditions hold. They do not durably hold at scale:

Failure mode 1: Partial rack allocation

Meta's training-job scheduler does not always place a whole rack into the same job — often only one of a rack's hosts uses the uplink for a given job. That means only a subset of RTSW downlink indices is active, so the path-pinning scheme routes the active flows onto a subset of CTSW paths, while other paths sit idle.

"The rack can be partially allocated to a job, with only one of the two hosts in the rack using the uplink bandwidth. This fragmented job placement caused uneven traffic distribution and congestion on the uplinks of the particular RTSW and degraded the training performance up to more than 30%."

Failure mode 2: Network failures re-shuffle flows

A link or CTSW failure forces affected flows to be re-assigned — but because path pinning is deterministic per slice and the scheme has no inherent load-sharing across the remaining paths, ECMP takes over for the re-assignment and collides reassigned flows with existing ones.

"Network failures on a uplink or a CTSW caused the affected flows to be unevenly reassigned to other CTSWs by ECMP. Those reassigned flows collided with other existing flows and slowed down the whole training job."

Short-term bandaid: 2× overprovisioning

Meta mitigated the immediate pain by doubling RTSW uplink bandwidth — running the uplinks at 1:2 under-subscription relative to downlinks. This masks the imbalance but costs 2× network capacity, named in the source as a "short-term mitigation."

(All quotes source: sources/2024-08-05-meta-a-roce-network-for-distributed-ai-training-at-scale)

Why Meta moved past it

The underlying lesson: deterministic routing schemes have brittle preconditions that real production systems violate regularly. Clean per-rack job placement is a scheduler invariant that breaks under fragmentation and fleet churn; "no network failures" is trivially false at scale.

Meta's subsequent move to E-ECMP with QP scaling is a return to hash-based routing, but with enough controlled entropy (QP field + NCCL multi-QP messaging) that the hash actually distributes well. The hash-based scheme degrades gracefully when its assumptions are violated, where path-pinning collapses.

Broader takeaway for sysdesign

Path pinning is an instance of a recurring sysdesign tension: deterministic schemes optimise for the common case but have pathological failure modes, while probabilistic schemes have worse average behaviour but graceful degradation. When production preconditions can't be enforced (fragmentation, failures, job-shape variation), probabilistic usually wins — even when the deterministic optimum looks better on paper.

Seen in

Last updated · 319 distilled / 1,201 read