Skip to content

SYSTEM Cited by 1 source

Robinhood (Dropbox in-house load balancing service)

Robinhood is Dropbox's in-house load balancing service for internal traffic. Deployed in 2020 to unify routing across Dropbox services and datacenters; rebuilt in 2023 around PID controller-driven feedback-control load balancing. Serves hundreds of thousands of hosts across multiple global datacenters. The pitched-outcome of the rebuild: ~25% fleet-size reduction on some of Dropbox's largest services and removal of a class of reliability incidents driven by over-utilized processes.

Architecture

Three main components per datacenter plus one ancillary service:

Load Balancing Service (LBS) — the control plane. - Collects per-node utilization reports per service. - Runs one PID controller per node, setpoint = average utilization across that service's nodes. - PID output = delta to the node's endpoint weight; weights are normalized across all endpoints of the service. - Sharded by service via Dropbox's in-house shard manager — each service has a single primary LBS worker at any time; services share no state, so LBS scales horizontally with service count × fleet size. - Utilization metric: default CPU; switchable to in-flight requests for non-CPU-bound services; max(CPU, in-flight) as a veto-pattern safety against I/O-degraded nodes spiralling under CPU-only feedback.

Proxy tier — TLS-fanout reduction + service-to-shard routing. - All nodes send load reports to the proxy (not directly to LBS). - Proxy forwards each report to the LBS shard that owns the service. - Reduces LBS connection count from O(nodes × services) to O(services); shields LBS from TLS-handshake storms during restart/push. - Horizontally scalable within the DC; this pattern is reused across Dropbox infra.

Routing database (ZooKeeper / etcd) — the data plane. - Stores (service, endpoint) → {host, ip, weight, …}. - Eventually consistent — sufficient for service discovery. - Watch-based push to clients; matches xDS streaming semantics. - During LB-strategy migration, different database entries hold weights for different strategies (see "Strategy migration" below).

Config Aggregator — ancillary service. - Problem: a single Robinhood mega-config owned by the LB team made every team's changes coupled; rollback required LB-team involvement. - Solution: shard the mega-config into one per-service config file owned by each team. The aggregator watches every per-service config and reconstructs the mega-config in real-time for LBS to consume. - Per-service rollback decouples from the global push; LB team stops being on the critical path. - Tombstone feature: deleting a service's Robinhood entry marks it tombstoned for several days before actual removal — solves a race between Robinhood-config removal and Envoy configs still referencing the service (different push cadences). - Generalizes as patterns/per-service-config-aggregator.

Client integration via xDS

Dropbox extended its service-discovery system to speak Envoy's Endpoint Discovery Service (EDS) — carrying the endpoint weights from the routing database.

  • Envoy clients consume EDS natively and do weighted round-robin per request.
  • gRPC clients consume via xDS A27. At the time of the post, gRPC upstream does not support endpoint-weight-aware weighted-RR, so Dropbox wrote a custom weighted-RR picker based on earliest-deadline-first scheduling (EDF yields weighted-RR with lower variance than ticket-bucket approaches).

One control plane, two data-plane consumers — the same architectural lever as Databricks EDS. concepts/xds-protocol generalizing beyond sidecars.

Cross-DC routing (two-layer weighted round-robin)

Within a DC, PID closes the feedback loop. Between DCs, Robinhood uses a static locality config hot-reloadable at runtime:

{
  zone_1: { zone_1: 100 },                   # 100% local
  zone_2: { zone_2: 50, zone_1: 50 },        # 50/50 split
}

Clients do two layers of weighted-RR: 1. Pick a zone by the static locality weights. 2. Pick an endpoint within the zone by Robinhood's feedback-driven weights.

Real-time DC failover is a config edit, not a control-loop reaction — latency/RTT/failover intent aren't things Dropbox wants a feedback loop to optimize away.

PID controller corner cases (generic to feedback-control LB)

  • LBS startup: restore PID state by reading the current endpoint weights from the routing DB; delay weight updates until enough load reports arrive. Trades a short settling window for no thundering-herd on restart.
  • Cold-start node: new nodes join with 0 utilization. Setting weight proportional to capacity would oscillate under feedback; instead, initial weight is low and the PID ramps it up to the average (patterns/slow-start-ramp-up semantically layered on top of feedback control).
  • Missing load reports: skip that node's weight update (keep current weight). If > 15% of reports are missing globally for a service, skip the weight update entirely — the average-utilization setpoint would be unreliable.
  • Degraded I/O + CPU-only feedback = dead spiral: a stuck-I/O node has low CPU, so CPU-only PID raises its weight, piling on more stuck requests. Fix: max(CPU, in-flight-requests) so non-CPU-bound saturation shows up.

LB strategy migration (weighted-sum blending)

Each service can have multiple strategies configured concurrently (e.g. round-robin + PID). Each strategy writes its own endpoint weights into different routing-DB entries. Clients compute:

blended_weight = w_new × α + w_old × (1 − α)

α is controlled by a percentage feature flag. Every client sees the same blended weight for every endpoint at any instant during the migration window — no inconsistent routing, no per-client feature-flag skew. When α reaches 100%, the old strategy is retired. Generalizes as patterns/weighted-sum-strategy-migration.

Measured outcome

Performance metric: maxCPU / avgCPU (service owners provision to the max-utilized node; driving the ratio to 1 frees unused capacity):

Service Before PID After PID Δ
Envoy proxy cluster 1.26 1.01 20%
Database frontend cluster 1.4 1.05 25%

Fleet-wide: ~25% fleet-size reduction on some of Dropbox's largest services; per-node CPU quantile spread (p5/avg/p95/max) collapses to a single line after PID enablement. See concepts/feedback-control-load-balancing#Measuring success for why max/avg is the correct LB metric.

Design principles stated

  • Default / zero config beats exposing many knobs. Most service owners just want the default.
  • Minimize client changes. Client upgrades take months (cadence from weekly to years); LBS changes roll out in minutes. Shift all algorithmic evolution into the LBS.
  • Plan migration at design time. Stated verbatim: "The amount of engineering time required for a migration should be a key metric for success."

Seen in

Last updated · 200 distilled / 1,178 read