Skip to content

CONCEPT Cited by 1 source

Feedback-control load balancing

Feedback-control load balancing names the class of LB strategies that close a control loop around each backend's observed load: the controller watches per-node utilization, compares against a setpoint (usually fleet-average utilization), and adjusts that node's routing weight to drive it toward the setpoint. PID is the canonical realization; proportional-only variants and ad-hoc "nudge up or down by ε" controllers are simpler cousins.

Distinguish from open-loop LB algorithms — round-robin, random, subset selection, patterns/power-of-two-choices — which make routing decisions without modeling the effect of their own past decisions on subsequent load. Open-loop algorithms are reactive (P2C sees the current load); feedback-controlled ones are convergent (PID drives utilization toward target).

Why you need it

Round-robin distributes requests evenly. That only distributes load evenly when:

  1. Every request costs the same on every node, AND
  2. Every node has the same capacity.

Both assumptions fail at scale:

  • Heterogeneous hardware. A fleet accumulated over years spans multiple generations of CPUs, NICs, SSDs, and (increasingly) GPUs — request cost per node varies materially. Dropbox names their sixth-generation hardware post as direct motivation.
  • Heterogeneous request cost. Real traffic has tail-cost requests: a handful of queries are orders of magnitude more expensive than the median.
  • Heterogeneous steady-state load. Some nodes run colocated background work, or serve different shard keys, or are in post-restart warm-up states.

Under all three, equal-request-count → unequal-CPU. Provisioning to the max-utilized node means the mean-utilized node runs under-loaded. That's the cost problem feedback control solves.

Measuring success

The correct metric for load balancing is max(utilization) / avg(utilization) across the backend fleet:

  • Service owners provision to the max (so nobody tips over).
  • The lever for cost is the mean.
  • Driving the ratio toward 1 directly translates to fleet-size reduction.

Contrast with p99 or avg latency, which measure user-experience rather than balancer performance. A round-robin fleet can have acceptable p99 and terrible max/avg (because you over-provisioned for the hotspot).

Observed impact numbers (from Robinhood):

Cluster Before PID After PID Fleet saving
Envoy proxy 1.26 1.01 ~20%
Database frontend 1.4 1.05 ~25%

Quantile spread (p5/avg/p95/max on per-node CPU) visibly collapses onto a single line after enablement.

Components of a feedback-control LB system

  1. Load-report channel — each backend periodically reports its utilization (CPU, in-flight-request count, queue depth, composite signal) to the control plane.
  2. Control plane — one controller instance per backend (or one controller with per-backend state); consumes load reports; computes weight adjustments.
  3. Weight-publication channel — updated weights reach clients. xDS / EDS is the common wire protocol. Robinhood writes weights into ZooKeeper/etcd; Envoy and gRPC clients consume via EDS.
  4. Weight-consuming LB algorithm — clients do weighted round-robin (or equivalent) using the published weights on per-request routing decisions. concepts/layer-7-load-balancing.
  5. Setpoint choice — usually fleet-average utilization, because the goal is balance, not any specific absolute level.

Failure modes

  • Feedback lag. If reports arrive too slowly or request latency is too long (minutes), the controller can't observe its own effect within a single control window → oscillation or non-convergence. systems/dropbox-robinhood calls this out explicitly: "services with high latency requests should be asynchronous."
  • Missing / delayed reports. Individual missed reports → freeze that node's weight. Many missed → the average setpoint itself is unreliable → freeze all updates. Robinhood's threshold: >15% missing → skip the weight-update phase entirely.
  • Process-variable blind spots. A single utilization metric can miss pathological modes. The canonical failure: I/O-degraded node + CPU-only feedback = dead spiral. Node's I/O is stuck → CPU stays low → controller chasing average CPU raises its weight → more stuck requests → more weight → spiral. Mitigation: multi-metric feedback, e.g. max(CPU, in-flight-requests).
  • Cold-start: a new node has 0 utilization. Default weight would oscillate under feedback; instead give low starting weight and let the controller ramp up (patterns/slow-start-ramp-up layered on feedback).
  • Startup / state restoration: controllers carry integral accumulators and previous-error state. Restart loses it; read last published weights back as a best-effort restore.
  • Gain tuning: too-high gains → oscillation; too-low → slow convergence. Ziegler-Nichols or empirical tuning in a canary environment.

Relationship to other LB strategies

Strategy Reacts to current load Converges on setpoint Handles heterogeneity
Round-robin
Random
Weighted round-robin (static) Partial (manual)
patterns/power-of-two-choices ✓ (current load) Partial
Load-header client-side selection ✓ (stale by N ms) Partial
Feedback control (PID) ✓ + own past effect

P2C and feedback control are complementary, not alternative: P2C is a stateless per-request algorithm good at smoothing instantaneous imbalance; feedback control is a stateful across-requests algorithm good at driving structural imbalance to zero. Running P2C inside feedback-driven weights is a plausible composition, though Dropbox's post uses weighted-RR directly on PID weights.

When not to use it

  • Low-traffic services: too few requests per control window → insufficient feedback signal.
  • Very high-latency services: can't observe effect within a control window.
  • Symmetric fleets where requests are uniform and nodes are identical: round-robin is already optimal; feedback adds complexity for no gain.
  • When the metric you can measure doesn't reflect real load. A misleading feedback signal is worse than no feedback.

Seen in

  • sources/2024-10-28-dropbox-robinhood-in-house-load-balancingRobinhood's 2023 iteration is the canonical production realization: PID controller per node, setpoint = fleet-average CPU (or in-flight requests, or max(CPU, in-flight)), output = endpoint weight delta, consumed by Envoy/gRPC via xDS EDS. max/avg CPU ratio drove from 1.26→1.01 and 1.4→1.05 on the two biggest clusters.
Last updated · 200 distilled / 1,178 read