Skip to content

CONCEPT Cited by 2 sources

Fat-flow load balancing

Definition

Fat flows are network flows characterised by large transfer size and long lifetime — the opposite of the many-short-connections pattern that internet-style traffic typically exhibits. Fat-flow load balancing is the family of techniques for distributing such flows across the multiple parallel paths of a network fabric so that no single path is saturated while others stand idle.

The problem arises because the default load-balancing primitive on Ethernet — ECMP (Equal-Cost Multipath) with 5-tuple hashing — was designed for exactly the opposite traffic shape. ECMP chooses the next-hop for a packet by hashing (src_ip, dst_ip, src_port, dst_port, proto). For many short flows, this distributes traffic roughly uniformly across paths. For a handful of long-lived fat flows, the hash is deterministic per flow, so each flow pins to a single path — and since there are few flows, the distribution is dominated by luck.

Meta's framing (2024-06)

Meta names fat flows as a first-order problem in GenAI training fabrics:

"Just like ranking jobs, GenAI jobs produce additional fat flows that make it hard to distribute traffic across all possible network paths. This required us to further invest in network load balancing and routing to achieve an optimal distribution of traffic across network resources." (Source: sources/2024-06-12-meta-how-meta-trains-large-language-models-at-scale)

This is explicitly called out as one of three stack-level optimisations that made Meta's 24K-GPU RoCE cluster performance-competitive with the InfiniBand sibling. The other two: parallelism-axis → topology-layer mapping, and topology-aware collectives.

Meta's Networking @Scale 2023 talk is the referenced deeper dive on RoCE-specific load balancing techniques.

Why training produces fat flows

  • Data-parallel gradient all-reduce transfers the full parameter tensor set once per step. For a 70B model in bf16, that's ~140 GB × (DP - 1) / DP of bandwidth per DP all-reduce.
  • Pipeline-parallel activation handoffs between stages are smaller per message but long-lived connections.
  • Tensor-parallel per-layer all-reduces are many, smaller — these are not the fat flows; the collective topology concept handles them.

The pathological mismatch is: a handful of very large, persistent connections land on a fabric whose primary load-balancing mechanism hashes all of them to the same path.

Mitigation classes

  • Adaptive routing (InfiniBand's native; Ethernet via specific-vendor features) — per-packet path decisions using congestion telemetry.
  • Packet spraying — deliberately send packets of a single flow across multiple paths; requires reorder-tolerant receivers.
  • Flowlet-based ECMP — detect inactivity gaps in a flow, re-hash on each "flowlet" so long flows don't pin permanently.
  • Explicit routing / traffic engineering — application-level awareness of paths (SDN-style), place specific flows on specific paths.
  • Priority-flow-control (PFC) + ECN tuning — doesn't redistribute flows but prevents head-of-line blocking from cascading.

Meta does not disclose which of these it uses; the 2024-06-12 post cites the Networking @Scale 2023 video for the details.

Why InfiniBand has it comparatively easier

InfiniBand's adaptive routing is a first-class protocol feature, and its fabric-wide congestion awareness is built into the HCAs (Host Channel Adapters). Ethernet/RoCE has to reach parity via a mix of vendor features, deployment discipline, and sometimes custom fabric software. This is part of why Meta's 24K-GPU RoCE cluster required explicit investment in load balancing to match its InfiniBand sibling.

Seen in

Meta's full routing evolution (2024-08-05 SIGCOMM)

The second Meta paper adds the concrete timeline of what was tried and why each step was insufficient:

Stage Scheme Outcome
v0 Baseline ECMP "Poor performance … due to low flow entropy"
v1 Path-pinning (per-RTSW-downlink-slice) >30% training degradation under partial rack allocation + failures
v1.5 2× RTSW uplink over-provisioning (1:2 under-subscription) Masks v1's failure but 2× capital cost — explicitly a bandaid
v2 E-ECMP + QP scaling (round-robin messages across QPs) +40% on AllReduce vs baseline

The lesson: deterministic schemes break under failures + fragmentation; controlled-entropy hash schemes degrade more gracefully. See concepts/enhanced-ecmp-qp-scaling for the landing configuration.

Last updated · 319 distilled / 1,201 read