CONCEPT Cited by 1 source

Feedback-control load balancing¶

Feedback-control load balancing names the class of LB strategies that close a control loop around each backend's observed load: the controller watches per-node utilization, compares against a setpoint (usually fleet-average utilization), and adjusts that node's routing weight to drive it toward the setpoint. PID is the canonical realization; proportional-only variants and ad-hoc "nudge up or down by ε" controllers are simpler cousins.

Distinguish from open-loop LB algorithms — round-robin, random, subset selection, patterns/power-of-two-choices — which make routing decisions without modeling the effect of their own past decisions on subsequent load. Open-loop algorithms are reactive (P2C sees the current load); feedback-controlled ones are convergent (PID drives utilization toward target).

Why you need it¶

Round-robin distributes requests evenly. That only distributes load evenly when:

Every request costs the same on every node, AND
Every node has the same capacity.

Both assumptions fail at scale:

Heterogeneous hardware. A fleet accumulated over years spans multiple generations of CPUs, NICs, SSDs, and (increasingly) GPUs — request cost per node varies materially. Dropbox names their sixth-generation hardware post as direct motivation.
Heterogeneous request cost. Real traffic has tail-cost requests: a handful of queries are orders of magnitude more expensive than the median.
Heterogeneous steady-state load. Some nodes run colocated background work, or serve different shard keys, or are in post-restart warm-up states.

Under all three, equal-request-count → unequal-CPU. Provisioning to the max-utilized node means the mean-utilized node runs under-loaded. That's the cost problem feedback control solves.

Measuring success¶

The correct metric for load balancing is max(utilization) / avg(utilization) across the backend fleet:

Service owners provision to the max (so nobody tips over).
The lever for cost is the mean.
Driving the ratio toward 1 directly translates to fleet-size reduction.

Contrast with p99 or avg latency, which measure user-experience rather than balancer performance. A round-robin fleet can have acceptable p99 and terrible max/avg (because you over-provisioned for the hotspot).

Observed impact numbers (from Robinhood):

Cluster	Before PID	After PID	Fleet saving
Envoy proxy	1.26	1.01	~20%
Database frontend	1.4	1.05	~25%

Quantile spread (p5/avg/p95/max on per-node CPU) visibly collapses onto a single line after enablement.

Components of a feedback-control LB system¶

Load-report channel — each backend periodically reports its utilization (CPU, in-flight-request count, queue depth, composite signal) to the control plane.
Control plane — one controller instance per backend (or one controller with per-backend state); consumes load reports; computes weight adjustments.
Weight-publication channel — updated weights reach clients. xDS / EDS is the common wire protocol. Robinhood writes weights into ZooKeeper/etcd; Envoy and gRPC clients consume via EDS.
Weight-consuming LB algorithm — clients do weighted round-robin (or equivalent) using the published weights on per-request routing decisions. concepts/layer-7-load-balancing.
Setpoint choice — usually fleet-average utilization, because the goal is balance, not any specific absolute level.

Failure modes¶

Feedback lag. If reports arrive too slowly or request latency is too long (minutes), the controller can't observe its own effect within a single control window → oscillation or non-convergence. systems/dropbox-robinhood calls this out explicitly: "services with high latency requests should be asynchronous."
Missing / delayed reports. Individual missed reports → freeze that node's weight. Many missed → the average setpoint itself is unreliable → freeze all updates. Robinhood's threshold: >15% missing → skip the weight-update phase entirely.
Process-variable blind spots. A single utilization metric can miss pathological modes. The canonical failure: I/O-degraded node + CPU-only feedback = dead spiral. Node's I/O is stuck → CPU stays low → controller chasing average CPU raises its weight → more stuck requests → more weight → spiral. Mitigation: multi-metric feedback, e.g. max(CPU, in-flight-requests).
Cold-start: a new node has 0 utilization. Default weight would oscillate under feedback; instead give low starting weight and let the controller ramp up (patterns/slow-start-ramp-up layered on feedback).
Startup / state restoration: controllers carry integral accumulators and previous-error state. Restart loses it; read last published weights back as a best-effort restore.
Gain tuning: too-high gains → oscillation; too-low → slow convergence. Ziegler-Nichols or empirical tuning in a canary environment.

Relationship to other LB strategies¶

Strategy	Reacts to current load	Converges on setpoint	Handles heterogeneity
Round-robin	✗	✗	✗
Random	✗	✗	✗
Weighted round-robin (static)	✗	✗	Partial (manual)
patterns/power-of-two-choices	✓ (current load)	✗	Partial
Load-header client-side selection	✓ (stale by N ms)	✗	Partial
Feedback control (PID)	✓ + own past effect	✓	✓

P2C and feedback control are complementary, not alternative: P2C is a stateless per-request algorithm good at smoothing instantaneous imbalance; feedback control is a stateful across-requests algorithm good at driving structural imbalance to zero. Running P2C inside feedback-driven weights is a plausible composition, though Dropbox's post uses weighted-RR directly on PID weights.

When not to use it¶

Low-traffic services: too few requests per control window → insufficient feedback signal.
Very high-latency services: can't observe effect within a control window.
Symmetric fleets where requests are uniform and nodes are identical: round-robin is already optimal; feedback adds complexity for no gain.
When the metric you can measure doesn't reflect real load. A misleading feedback signal is worse than no feedback.

Seen in¶

sources/2024-10-28-dropbox-robinhood-in-house-load-balancing — Robinhood's 2023 iteration is the canonical production realization: PID controller per node, setpoint = fleet-average CPU (or in-flight requests, or max(CPU, in-flight)), output = endpoint weight delta, consumed by Envoy/gRPC via xDS EDS. max/avg CPU ratio drove from 1.26→1.01 and 1.4→1.05 on the two biggest clusters.

concepts/pid-controller — the underlying control-theory primitive
systems/dropbox-robinhood — the canonical production case
concepts/client-side-load-balancing — the usual data-plane shape
concepts/layer-7-load-balancing — the usual wire substrate
patterns/power-of-two-choices — the dominant open-loop alternative