Skip to content

CONCEPT Cited by 1 source

PID controller (feedback control)

A PID controller (proportional–integral–derivative) is a 90-year-old control-theory primitive that drives a process variable toward a setpoint by computing a correction from three terms over the tracked error:

  • P — proportional to the current error.
  • I — proportional to the integral of error over time (kills steady-state offset).
  • D — proportional to the derivative of error (damps oscillation, anticipates overshoot).

output = K_p · e(t) + K_i · ∫e(t)dt + K_d · de(t)/dt, where e(t) = setpoint − measured(t).

Canonical industrial control applications (thermostats, cruise control, chemical-plant flow control) are all direct uses of PID.

In distributed-systems contexts

PID is an increasingly common building block for closed-loop infrastructure when a scalar process variable (latency, utilization, queue depth, error rate, replica count) needs to track a scalar setpoint. Uses include:

  • Load balancing — per-node endpoint weight is the actuator; node utilization is the process variable; fleet-average utilization is the setpoint. See concepts/feedback-control-load-balancing and systems/dropbox-robinhood.
  • Autoscaling — replica count or pod count as actuator; request rate, CPU utilization, or queue depth as process variable; target utilization as setpoint. Kubernetes HPA is effectively a P-controller (no I, no D) with extra bounds.
  • Concurrency control — Netflix's concurrency-limits / adaptive concurrency libraries use PI / gradient-descent-with-damping analogs to drive in-flight-request count against an observed-latency setpoint.
  • Backpressure / rate-limiting — token-bucket refill rate adjusted to hold a downstream-queue-depth signal at a setpoint.

Why it works in infrastructure

PID is a good fit when:

  1. The process variable is a scalar with a clear "too high / too low" interpretation.
  2. The actuator effect is monotone — increasing the actuator predictably moves the process variable in one direction.
  3. Feedback latency is shorter than the control window — i.e. you can observe the effect of a correction before the next decision.

The Dropbox Robinhood post calls out the third condition explicitly: services with very low traffic or very high-latency requests (minutes) don't satisfy the feedback-latency precondition, and PID "won't be as effective." Their recommendation: those services should be asynchronous in the first place.

Corner cases that show up in practice

  • Startup / state restoration: PID carries state (integral accumulator, previous error). Restart drops state. Standard fixes: restore from a durable store, or wait for a warm-up window before re-enabling control. systems/dropbox-robinhood restores per-node weights (which reflect prior controller state) from the ZK/etcd routing database on LBS restart.
  • Cold-start actuator: a new node / new resource starts at "0 utilization." Setting it to the fleet-average weight would oscillate; the standard fix is low initial weight + ramp under PID. Equivalent to patterns/slow-start-ramp-up.
  • Missing measurements: if the process variable isn't observable, the correct action is "hold previous output, don't guess." If many measurements are missing, the setpoint calculation itself becomes unreliable — freeze the controller entirely. Robinhood freezes weight updates when >15% of service nodes' load reports are missing.
  • Single-metric blind spots: a single process variable can miss pathological modes — Robinhood's classic example: CPU as the only signal fails for I/O-degraded nodes (CPU low → PID raises weight → spiral). Mitigation is either a multi-metric process variable (max(CPU, in-flight)) or a separate veto / safety layer outside the PID loop.
  • Tuning: K_p / K_i / K_d gains are notoriously tricky. Ziegler-Nichols is the classical recipe; in infrastructure contexts, iterative tuning in a canary cluster is more common. Over-high gains cause oscillation; over-low gains mean slow convergence.

Why PID over simpler alternatives

Against open-loop strategies (round-robin, random, patterns/power-of-two-choices):

Strategy Reacts to observed effect? Converges on setpoint? Handles heterogeneity?
Round-robin
Random
Power of Two Partial (sees current load, not its own effect) No Partial — reactive, not convergent
PID (feedback)

PID is strictly more powerful when you need convergence on a target, at the cost of needing a scalar setpoint, observable feedback, and gain tuning.

Limitations

  • Scalar setpoint only. Multi-dimensional optimization (e.g. latency and cost and availability simultaneously) requires composition of controllers or a step up to MPC (model predictive control).
  • No model of the system. PID treats the plant as a black box. For known-structured problems, MPC or explicit solvers can beat PID.
  • Actuator saturation. If the actuator has bounds (weight ∈ [0, max]), the integral term can "wind up" accumulating error that can't be corrected. Anti-windup mechanisms (clamping, back-calculation) are a standard mitigation.
  • Measurement noise → differentiator noise. The D term amplifies high-frequency noise in the process variable. Filtered derivative or dropped D term is common in practice.

Seen in

  • sources/2024-10-28-dropbox-robinhood-in-house-load-balancingRobinhood's 2023 iteration runs one PID controller per node per service, setpoint = fleet-average utilization, output = endpoint weight delta. Reported impact: max/avg CPU ratio dropped from 1.26→1.01 and 1.4→1.05 on two large clusters, enabling ~25% fleet-size reduction on some of the largest services. The post is also a compact catalogue of PID-in-infra corner cases (startup restore, cold-start ramp-up, missing-measurement freeze, multi-metric safety).
Last updated · 200 distilled / 1,178 read