CONCEPT Cited by 1 source
Feedback-control load balancing¶
Feedback-control load balancing names the class of LB strategies that close a control loop around each backend's observed load: the controller watches per-node utilization, compares against a setpoint (usually fleet-average utilization), and adjusts that node's routing weight to drive it toward the setpoint. PID is the canonical realization; proportional-only variants and ad-hoc "nudge up or down by ε" controllers are simpler cousins.
Distinguish from open-loop LB algorithms — round-robin, random, subset selection, patterns/power-of-two-choices — which make routing decisions without modeling the effect of their own past decisions on subsequent load. Open-loop algorithms are reactive (P2C sees the current load); feedback-controlled ones are convergent (PID drives utilization toward target).
Why you need it¶
Round-robin distributes requests evenly. That only distributes load evenly when:
- Every request costs the same on every node, AND
- Every node has the same capacity.
Both assumptions fail at scale:
- Heterogeneous hardware. A fleet accumulated over years spans multiple generations of CPUs, NICs, SSDs, and (increasingly) GPUs — request cost per node varies materially. Dropbox names their sixth-generation hardware post as direct motivation.
- Heterogeneous request cost. Real traffic has tail-cost requests: a handful of queries are orders of magnitude more expensive than the median.
- Heterogeneous steady-state load. Some nodes run colocated background work, or serve different shard keys, or are in post-restart warm-up states.
Under all three, equal-request-count → unequal-CPU. Provisioning to the max-utilized node means the mean-utilized node runs under-loaded. That's the cost problem feedback control solves.
Measuring success¶
The correct metric for load balancing is max(utilization) / avg(utilization) across the backend fleet:
- Service owners provision to the max (so nobody tips over).
- The lever for cost is the mean.
- Driving the ratio toward 1 directly translates to fleet-size reduction.
Contrast with p99 or avg latency, which measure user-experience rather than balancer performance. A round-robin fleet can have acceptable p99 and terrible max/avg (because you over-provisioned for the hotspot).
Observed impact numbers (from Robinhood):
| Cluster | Before PID | After PID | Fleet saving |
|---|---|---|---|
| Envoy proxy | 1.26 | 1.01 | ~20% |
| Database frontend | 1.4 | 1.05 | ~25% |
Quantile spread (p5/avg/p95/max on per-node CPU) visibly collapses onto a single line after enablement.
Components of a feedback-control LB system¶
- Load-report channel — each backend periodically reports its utilization (CPU, in-flight-request count, queue depth, composite signal) to the control plane.
- Control plane — one controller instance per backend (or one controller with per-backend state); consumes load reports; computes weight adjustments.
- Weight-publication channel — updated weights reach clients. xDS / EDS is the common wire protocol. Robinhood writes weights into ZooKeeper/etcd; Envoy and gRPC clients consume via EDS.
- Weight-consuming LB algorithm — clients do weighted round-robin (or equivalent) using the published weights on per-request routing decisions. concepts/layer-7-load-balancing.
- Setpoint choice — usually fleet-average utilization, because the goal is balance, not any specific absolute level.
Failure modes¶
- Feedback lag. If reports arrive too slowly or request latency is too long (minutes), the controller can't observe its own effect within a single control window → oscillation or non-convergence. systems/dropbox-robinhood calls this out explicitly: "services with high latency requests should be asynchronous."
- Missing / delayed reports. Individual missed reports → freeze that node's weight. Many missed → the average setpoint itself is unreliable → freeze all updates. Robinhood's threshold: >15% missing → skip the weight-update phase entirely.
- Process-variable blind spots. A single utilization metric can miss pathological modes. The canonical failure: I/O-degraded node + CPU-only feedback = dead spiral. Node's I/O is stuck → CPU stays low → controller chasing average CPU raises its weight → more stuck requests → more weight → spiral. Mitigation: multi-metric feedback, e.g.
max(CPU, in-flight-requests). - Cold-start: a new node has 0 utilization. Default weight would oscillate under feedback; instead give low starting weight and let the controller ramp up (patterns/slow-start-ramp-up layered on feedback).
- Startup / state restoration: controllers carry integral accumulators and previous-error state. Restart loses it; read last published weights back as a best-effort restore.
- Gain tuning: too-high gains → oscillation; too-low → slow convergence. Ziegler-Nichols or empirical tuning in a canary environment.
Relationship to other LB strategies¶
| Strategy | Reacts to current load | Converges on setpoint | Handles heterogeneity |
|---|---|---|---|
| Round-robin | ✗ | ✗ | ✗ |
| Random | ✗ | ✗ | ✗ |
| Weighted round-robin (static) | ✗ | ✗ | Partial (manual) |
| patterns/power-of-two-choices | ✓ (current load) | ✗ | Partial |
| Load-header client-side selection | ✓ (stale by N ms) | ✗ | Partial |
| Feedback control (PID) | ✓ + own past effect | ✓ | ✓ |
P2C and feedback control are complementary, not alternative: P2C is a stateless per-request algorithm good at smoothing instantaneous imbalance; feedback control is a stateful across-requests algorithm good at driving structural imbalance to zero. Running P2C inside feedback-driven weights is a plausible composition, though Dropbox's post uses weighted-RR directly on PID weights.
When not to use it¶
- Low-traffic services: too few requests per control window → insufficient feedback signal.
- Very high-latency services: can't observe effect within a control window.
- Symmetric fleets where requests are uniform and nodes are identical: round-robin is already optimal; feedback adds complexity for no gain.
- When the metric you can measure doesn't reflect real load. A misleading feedback signal is worse than no feedback.
Seen in¶
- sources/2024-10-28-dropbox-robinhood-in-house-load-balancing — Robinhood's 2023 iteration is the canonical production realization: PID controller per node, setpoint = fleet-average CPU (or in-flight requests, or
max(CPU, in-flight)), output = endpoint weight delta, consumed by Envoy/gRPC via xDS EDS. max/avg CPU ratio drove from 1.26→1.01 and 1.4→1.05 on the two biggest clusters.
Related¶
- concepts/pid-controller — the underlying control-theory primitive
- systems/dropbox-robinhood — the canonical production case
- concepts/client-side-load-balancing — the usual data-plane shape
- concepts/layer-7-load-balancing — the usual wire substrate
- patterns/power-of-two-choices — the dominant open-loop alternative