SYSTEM Cited by 1 source
Robinhood (Dropbox in-house load balancing service)¶
Robinhood is Dropbox's in-house load balancing service for internal traffic. Deployed in 2020 to unify routing across Dropbox services and datacenters; rebuilt in 2023 around PID controller-driven feedback-control load balancing. Serves hundreds of thousands of hosts across multiple global datacenters. The pitched-outcome of the rebuild: ~25% fleet-size reduction on some of Dropbox's largest services and removal of a class of reliability incidents driven by over-utilized processes.
Architecture¶
Three main components per datacenter plus one ancillary service:
Load Balancing Service (LBS) — the control plane.
- Collects per-node utilization reports per service.
- Runs one PID controller per node, setpoint = average utilization across that service's nodes.
- PID output = delta to the node's endpoint weight; weights are normalized across all endpoints of the service.
- Sharded by service via Dropbox's in-house shard manager — each service has a single primary LBS worker at any time; services share no state, so LBS scales horizontally with service count × fleet size.
- Utilization metric: default CPU; switchable to in-flight requests for non-CPU-bound services; max(CPU, in-flight) as a veto-pattern safety against I/O-degraded nodes spiralling under CPU-only feedback.
Proxy tier — TLS-fanout reduction + service-to-shard routing.
- All nodes send load reports to the proxy (not directly to LBS).
- Proxy forwards each report to the LBS shard that owns the service.
- Reduces LBS connection count from O(nodes × services) to O(services); shields LBS from TLS-handshake storms during restart/push.
- Horizontally scalable within the DC; this pattern is reused across Dropbox infra.
Routing database (ZooKeeper / etcd) — the data plane.
- Stores (service, endpoint) → {host, ip, weight, …}.
- Eventually consistent — sufficient for service discovery.
- Watch-based push to clients; matches xDS streaming semantics.
- During LB-strategy migration, different database entries hold weights for different strategies (see "Strategy migration" below).
Config Aggregator — ancillary service. - Problem: a single Robinhood mega-config owned by the LB team made every team's changes coupled; rollback required LB-team involvement. - Solution: shard the mega-config into one per-service config file owned by each team. The aggregator watches every per-service config and reconstructs the mega-config in real-time for LBS to consume. - Per-service rollback decouples from the global push; LB team stops being on the critical path. - Tombstone feature: deleting a service's Robinhood entry marks it tombstoned for several days before actual removal — solves a race between Robinhood-config removal and Envoy configs still referencing the service (different push cadences). - Generalizes as patterns/per-service-config-aggregator.
Client integration via xDS¶
Dropbox extended its service-discovery system to speak Envoy's Endpoint Discovery Service (EDS) — carrying the endpoint weights from the routing database.
- Envoy clients consume EDS natively and do weighted round-robin per request.
- gRPC clients consume via xDS A27. At the time of the post, gRPC upstream does not support endpoint-weight-aware weighted-RR, so Dropbox wrote a custom weighted-RR picker based on earliest-deadline-first scheduling (EDF yields weighted-RR with lower variance than ticket-bucket approaches).
One control plane, two data-plane consumers — the same architectural lever as Databricks EDS. concepts/xds-protocol generalizing beyond sidecars.
Cross-DC routing (two-layer weighted round-robin)¶
Within a DC, PID closes the feedback loop. Between DCs, Robinhood uses a static locality config hot-reloadable at runtime:
Clients do two layers of weighted-RR: 1. Pick a zone by the static locality weights. 2. Pick an endpoint within the zone by Robinhood's feedback-driven weights.
Real-time DC failover is a config edit, not a control-loop reaction — latency/RTT/failover intent aren't things Dropbox wants a feedback loop to optimize away.
PID controller corner cases (generic to feedback-control LB)¶
- LBS startup: restore PID state by reading the current endpoint weights from the routing DB; delay weight updates until enough load reports arrive. Trades a short settling window for no thundering-herd on restart.
- Cold-start node: new nodes join with 0 utilization. Setting weight proportional to capacity would oscillate under feedback; instead, initial weight is low and the PID ramps it up to the average (patterns/slow-start-ramp-up semantically layered on top of feedback control).
- Missing load reports: skip that node's weight update (keep current weight). If > 15% of reports are missing globally for a service, skip the weight update entirely — the average-utilization setpoint would be unreliable.
- Degraded I/O + CPU-only feedback = dead spiral: a stuck-I/O node has low CPU, so CPU-only PID raises its weight, piling on more stuck requests. Fix:
max(CPU, in-flight-requests)so non-CPU-bound saturation shows up.
LB strategy migration (weighted-sum blending)¶
Each service can have multiple strategies configured concurrently (e.g. round-robin + PID). Each strategy writes its own endpoint weights into different routing-DB entries. Clients compute:
α is controlled by a percentage feature flag. Every client sees the same blended weight for every endpoint at any instant during the migration window — no inconsistent routing, no per-client feature-flag skew. When α reaches 100%, the old strategy is retired. Generalizes as patterns/weighted-sum-strategy-migration.
Measured outcome¶
Performance metric: maxCPU / avgCPU (service owners provision to the max-utilized node; driving the ratio to 1 frees unused capacity):
| Service | Before PID | After PID | Δ |
|---|---|---|---|
| Envoy proxy cluster | 1.26 | 1.01 | 20% |
| Database frontend cluster | 1.4 | 1.05 | 25% |
Fleet-wide: ~25% fleet-size reduction on some of Dropbox's largest services; per-node CPU quantile spread (p5/avg/p95/max) collapses to a single line after PID enablement. See concepts/feedback-control-load-balancing#Measuring success for why max/avg is the correct LB metric.
Design principles stated¶
- Default / zero config beats exposing many knobs. Most service owners just want the default.
- Minimize client changes. Client upgrades take months (cadence from weekly to years); LBS changes roll out in minutes. Shift all algorithmic evolution into the LBS.
- Plan migration at design time. Stated verbatim: "The amount of engineering time required for a migration should be a key metric for success."
Seen in¶
- sources/2024-10-28-dropbox-robinhood-in-house-load-balancing — the 2024 post by Dropbox's runtime / services-platform teams; describes the PID iteration, architecture, cross-DC routing, config aggregator, migration strategy, and measured impact.
Related¶
- concepts/pid-controller — the control-theory primitive.
- concepts/feedback-control-load-balancing — the category of LB strategies Robinhood implements.
- concepts/xds-protocol — the discovery substrate Robinhood uses to feed both Envoy and gRPC clients.
- concepts/client-side-load-balancing — the deployment model.
- concepts/control-plane-data-plane-separation — LBS (decide) vs clients (deliver).
- systems/envoy — primary client-side consumer of Robinhood EDS responses.
- systems/grpc — secondary consumer via gRPC A27 xDS.
- systems/databricks-endpoint-discovery-service — independent realization of the same one-xDS-control-plane-for-both-internal-and-ingress idea.
- patterns/weighted-sum-strategy-migration — Robinhood-originated migration pattern.
- patterns/per-service-config-aggregator — Robinhood-originated config pattern.
- patterns/proxyless-service-mesh — architectural class Robinhood fits (client library + xDS, no sidecars).