Skip to content

DROPBOX 2024-10-28 Tier 2

Read original ↗

What's new with Robinhood, our in-house load balancing service

Summary

Robinhood is Dropbox's in-house internal-traffic load balancing service, deployed since 2020 and rebuilt in 2023 around PID-controller-driven feedback-control load balancing over an xDS / Envoy-EDS control plane. Each service gets a bank of per-node PID controllers whose setpoint is the average utilization across that service's nodes; the controller output becomes the delta to each endpoint's weight, normalized across the service. The LBS writes weighted endpoints into a ZooKeeper/etcd routing database, and Envoy/gRPC clients do weighted round-robin per-request routing. Three-part architecture — Load Balancing Service (LBS) + Proxy (fanout-reduction tier that avoids the N²-connection problem and shields LBS from TLS fanout) + Routing Database (ZK/etcd). A fourth ancillary service — Config Aggregator — shards each service's Robinhood config into its own file, lets each team push independently, and reassembles the mega-config for LBS. Migration between LB strategies was staged via weighted-sum of per-strategy weights gated by a percentage feature flag so every client sees the same blended endpoint weight during rollout. Reported outcomes: max/avg CPU ratio on one Envoy cluster 1.26 → 1.01 (20% reduction) and on a DB frontend cluster 1.4 → 1.05 (25% reduction) — enabling 25% fleet-size reduction on some of Dropbox's largest services and removing a class of over-utilized-process reliability incidents.

Key takeaways

  1. Load balancing at scale is a control problem, not a placement problem. Round-robin, subset rotation, and load-header-aware client-side selection had each been tried. None closed the gap because they're open-loop: they don't incorporate the effect of their own decisions on subsequent node utilization. PID puts a feedback loop around each node, converging its utilization on the fleet average regardless of hardware heterogeneity or request-cost skew. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  2. max/avg ratio is the right performance metric for LB, not p99 or average. Service owners provision to the maximum-utilized node, so the cost lever is maxCPU / avgCPU. A ratio → 1 means fleet capacity is actually being used. Robinhood's PID iteration drove Envoy-proxy max/avg from 1.26 to 1.01 and DB-frontend max/avg from 1.4 to 1.05. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  3. CPU alone is a pathological LB signal under I/O degradation. If a node's I/O is stuck, CPU stays low, the PID controller (blindly chasing the average CPU) increases its weight, sending more traffic — and the node spirals. Dropbox's fix: use max(CPU, in-flight-requests) as the feedback signal, so a non-CPU-bound saturation shows up. Generalizable: any single-metric feedback system needs a veto signal for pathologies the metric can't see. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  4. xDS is the LB-integration contract, not a sidecar thing. Robinhood's LBS emits endpoint weights into ZK/etcd; Envoy's EDS consumes the feed directly; gRPC clients consume via gRPC proposal A27 (xDS for gRPC global LB). One control plane feeds both — same pattern Databricks independently arrived at via their own EDS. gRPC's xDS LB doesn't support endpoint-weight weighted-RR out of the box as of the post's date; Dropbox wrote a custom weighted-RR picker based on earliest-deadline-first scheduling. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  5. A fanout-reduction proxy tier is nearly free architectural leverage. Without it, every LBS process would hold a connection to every node in the DC → TLS-handshake-storms on LBS restart + memory pressure proportional to fleet. The proxy accepts per-DC connections from nodes, forwards load reports to the LBS shard that owns the service, and scales horizontally independent of LBS shard count. Dropbox says this "TLS-fanout-reduction" proxy pattern is used across their infra. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  6. Cold-start, missed reports, and startup state are the corner cases that cost the most design time. Each PID controller's state (accumulated integral, previous error) is non-trivial to reconstruct after an LBS restart; Dropbox restores weights from the routing database and waits for load reports before updating. New nodes start at low weight and let the PID ramp them up. If >15% of load reports for a service are missing, the LBS skips weight updates entirely rather than acting on a skewed average. All three are generalizable feedback-control corner cases, not Robinhood-specific. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  7. Cross-DC routing is a different problem with a different control surface. Within a DC, Robinhood PID-balances endpoints. Between DCs, traffic splits are a statically configured locality policy with hot-reload (explicit client-dc → dest-dc → weight table), because RTT / failover intent / closest-DC semantics aren't things to optimize away via feedback. Two layers of weighted round-robin then compose: client picks a zone by static weight, then picks an endpoint within the zone by PID weight. Real-time failover between DCs is a config edit, not a control-loop reaction. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  8. Shard the config, not just the service. Robinhood's first config model was a single mega-config built and pushed by the LB team. Every breaking change required LB-team involvement to roll back. Solution: Config Aggregator splits the mega-config into one file per service owned by that service's team, aggregates them in real-time, writes back the mega-config. Rollbacks become per-service; the LB team stops being on the critical path. Also adds tombstoning (don't delete entries for days — solves race between Robinhood config removal and downstream Envoy config still referencing the service). Pattern generalizes as patterns/per-service-config-aggregator. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  9. Migrate between LB strategies via weighted-sum blending, not flag-gated cutover. Each strategy (round-robin, PID) writes its own endpoint weights into the routing DB. Clients compute blended_weight = w_new × α + w_old × (1 − α) with α controlled by a percentage feature gate. Result: every client sees the same effective weight for every endpoint during the rollout, so there's no inconsistent routing during the migration window. Pattern generalizes as patterns/weighted-sum-strategy-migration. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  10. Design client changes once, then don't touch them again. Dropbox's mobile/server clients upgrade on cadences from "weekly" to "not for years." Robinhood decided up-front to put weighted round-robin in the client and never change it; all algorithm evolution happens in the LBS. Weight changes roll out in minutes because they're LBS-side; client changes take months. Frames as an invariant: the more you can shift to the control-plane, the faster you can iterate. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  11. Configuration simplicity is a production feature. Robinhood exposes many knobs; most service owners just want a default. "A good, simple default config — or even better, zero config — can save tons of engineering time." Echoes Databricks' same retracted-metrics-based-routing → "kept P2C, stopped adding complexity" lesson. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)
  12. Migration strategy must be designed in at project start, not retrofitted. Stated verbatim: "The migration process for Robinhood was not well-designed from the very beginning, so we ended up spending much more time than expected reimplementing the process and redesigning the configuration. The amount of engineering time required for a migration should be a key metric for success." Generalizable to any fundamental-infra component: if adopters can't migrate on their own, the project is partially blocked on its adopters' backlogs. (Source: sources/2024-10-28-dropbox-robinhood-in-house-load-balancing)

Architecture

Three primary components (per data center)

Load Balancing Service (LBS) — the heart of Robinhood. - Receives per-node utilization reports from services. - For each service, a set of PID controllers (one per node) — setpoint = average utilization across that service's nodes. - PID output = delta to the node's endpoint weight; weights are normalized across the service's endpoints. - Horizontally sharded by service via Dropbox's in-house shard manager — each service has exactly one primary LBS worker (concurrent write isolation), and services share nothing, so the LBS scales with total services × fleet size. - Utilization metric: CPU as default; in-flight requests for services not CPU-bound; max(CPU, in-flight) as the veto-pattern safety against I/O-degraded nodes spiralling under CPU-only control.

Proxy tier — routes load reports from all nodes in the DC to the LBS shard that owns their service. - Every node sends load reports to the proxy, not directly to LBS. - Proxy forwards each report to the LBS shard owning the service (based on shard-manager assignment). - Reduces LBS connection count from O(nodes × services) to O(services) (per LBS replica), and reduces TLS-handshake storms during LBS push/restart. - Horizontally scalable within the DC; pattern is used across Dropbox infra.

Routing Database (ZooKeeper/etcd) — the data plane between LBS and clients. - Stores (service, endpoint) → {host, ip, weight, ...}. - Eventually consistent (sufficient for service discovery). - Watch-based push to all subscribed clients — matches xDS streaming semantics. - When migrating between LB strategies, different entries hold different strategies' weights (see migration section).

Client integration via xDS

  • Dropbox extended its service-discovery system to export Envoy's Endpoint Discovery Service (EDS) responses — endpoint lists carry weights from the routing DB.
  • Envoy clients: native EDS → weighted round-robin per request.
  • gRPC clients: xDS via gRPC A27; since gRPC upstream doesn't support endpoint-weight WRR yet, Dropbox wrote a custom weighted-RR picker based on earliest-deadline-first scheduling (EDF gives weighted-RR with lower variance than ticket-bucket approaches).
  • Same control plane, two data-plane consumers — the same architectural moat as Databricks EDS and the explicit power of concepts/xds-protocol generalizing beyond sidecars.

Cross-DC routing (two-layer weighted round-robin)

Within a DC, LBS-driven weights close a feedback loop. Between DCs, routing is governed by a static locality config:

{
  zone_1: { zone_1: 100 },                    # all-local
  zone_2: { zone_2: 50, zone_1: 50 },         # 50/50 split
}
  • Entries live in the service-discovery EDS response.
  • Clients (Envoy + gRPC) do two layers of weighted round-robin:
  • Pick a zone via the static locality config.
  • Pick an endpoint within that zone via Robinhood's feedback-driven weights.
  • Hot reload of the locality config enables real-time data-center failover as a config edit, not a control-loop reaction.

Config aggregator

  • Problem: a single mega-config for all services had every team's changes coupled. A bad push forced the LB team to own rollback, and "it's risky to roll back because we don't know what other teams changed since the last push." Each push took hours across DCs.
  • Fix: each service owns its own Robinhood config file in the codebase. A separate config-aggregator service watches all per-service configs and constructs the mega-config that LBS consumes, propagating changes in real-time.
  • Tombstone feature: when a service's config is deleted, the aggregator marks it as tombstoned rather than removing immediately; actual removal happens days later. Prevents accidental deletion and solves a race condition between Robinhood config removal and Envoy configs still referencing the service (different push cadences).
  • Periodic backups of the mega-config compensate for the fact that the underlying config-management service isn't versioned.

LB strategy migration: weighted-sum blending

  • Each service can have multiple strategies configured concurrently (e.g. round-robin + PID).
  • LBS writes each strategy's endpoint weights into different entries in the routing DB.
  • Clients compute blended_weight = w_PID × α + w_RR × (1 − α) where α is a percentage feature gate.
  • Every client sees the same blended weight for every endpoint at any moment in the rollout window → no inconsistent routing during migration.
  • When α reaches 100%, the old strategy's weights stop being read.

LBS corner cases

  • LBS startup: restore PID-controller state by reading current endpoint weights from the routing DB; delay weight updates until enough load reports arrive.
  • Cold-start node: a fresh node joins with 0 utilization. Rather than immediately giving it a "fair" share (which, given PID feedback, would oscillate), set its initial weight low and let the PID ramp it up to average — a form of patterns/slow-start-ramp-up on top of feedback control.
  • Missing load reports: skip that node's weight update (keep current weight). If >15% of reports are missing globally for the service, skip the weight update entirely — the average setpoint would be unreliable.

Evaluated performance

Metric: maxCPU / avgCPU (service owners provision to the max-utilized node; driving this ratio to 1 means provisioning for actual mean workload).

Service Before PID (max/avg) After PID (max/avg) Improvement
Envoy proxy cluster 1.26 1.01 20%
Database frontend cluster 1.4 1.05 25%

Fleet-wide outcomes: - ~25% fleet-size reduction for some of Dropbox's largest services. - Reliability improvements via fewer over-utilized processes. - Quantile spread (p5/avg/p95/max on per-node CPU) collapses into a single line after PID enablement.

Operational numbers

  • Hundreds of thousands of hosts across multiple DCs globally via the in-house service-discovery system.
  • Millions of clients per service on some services — motivates client-subset selection at discovery time.
  • 15% missing-load-reports threshold for LBS to skip weight updates.
  • Two layers of weighted round-robin (cross-DC + intra-DC) compose in the client.
  • PID per-node state is restored from routing DB on LBS restart; a short settling window is tolerated on cold start.
  • GPU-capable nodes explicitly called out as motivating the effort — heterogeneous hardware + LLM-serving workloads make uneven-load much more expensive than uniform-hardware fleets.

Systems / concepts / patterns extracted

Systems

  • systems/dropbox-robinhoodnew. The load-balancing service this post is about.
  • systems/envoy — existing. Robinhood's EDS responses are consumed natively by Envoy clients for weighted-RR per-request routing.
  • systems/grpc — existing. Robinhood extends xDS (A27) to gRPC; a custom earliest-deadline-first weighted-RR picker fills in for gRPC upstream not supporting endpoint-weight WRR yet.
  • systems/zookeeper (minimal cross-ref) — ZK/etcd as routing DB; eventual consistency is sufficient for service discovery.
  • systems/shard-manager — existing (Meta Shard Manager page). Dropbox references their own in-house shard manager for LBS primary-worker assignment; the underlying problem (one worker per service at a time) is the same.

Concepts

Patterns

  • patterns/weighted-sum-strategy-migrationnew. Blend weights from multiple LB strategies via a percentage feature flag so every client sees the same effective routing during the migration window.
  • patterns/per-service-config-aggregatornew. Shard a centralized mega-config into per-service files owned by service teams; a lightweight aggregator assembles the mega-config in real-time; rollbacks and breaking changes are per-service; central team stops being on the critical path. Tombstone feature for soft delete with later reclamation.
  • patterns/proxyless-service-mesh — existing. Robinhood fits the pattern: shared client library (Envoy/gRPC) + xDS control plane, no sidecars.
  • patterns/slow-start-ramp-up — existing. PID new-node cold-start uses it implicitly.

Caveats

  • PID feedback requires enough feedback. The post notes explicitly: "if there is little feedback — for example, in the case of a very low traffic service, or very high-latency requests measured in minutes — the load balancing won't be as effective. We argue that services with high latency requests should be asynchronous." So PID is not universal.
  • CPU-only PID misbehaves under I/O degradation. A node with degraded I/O keeps CPU low while requests pile up; CPU-only PID raises its weight → dead spiral. Dropbox's max(CPU, in-flight) is a specific mitigation, generalizable but not universal.
  • ZK/etcd is eventually consistent. Fine for service discovery but adds a staleness bound on routing correctness — same tradeoff as Databricks EDS.
  • Config-management-service is not versioned. Robinhood relies on periodic backups of the mega-config; a structural gap the aggregator works around but does not fix.
  • Tier 2 content is self-reported. Numbers (25% fleet reduction, 20–25% max/avg improvement) aren't independently audited, but graphs are consistent with expected behavior of weighted-load-balancing under heterogeneous hardware.
  • gRPC weighted-RR workaround is a snapshot in time. gRPC upstream may eventually support endpoint-weight-aware WRR natively; Dropbox's EDF-based picker is a forward-compatible stopgap.
  • Cross-DC routing is not feedback-controlled. Locality config is static (with hot reload). Automatic cross-DC rebalancing is not in scope.
  • "GPU workload motivation" is stated but not quantified. The post names AI/LLM workloads and GPU resource management as a reason the effort continues, but does not give GPU-specific numbers.

Raw

Last updated · 200 distilled / 1,178 read