Client-Side Load Balancing at a Million Requests Per Second¶
Summary¶
Zalando's Product Read API (PRAPI) team replaced the shared Skipper ingress load balancer on their internal fan-out path with an in-process client-side load balancer (CSLB), eliminating over a million requests per second from shared infrastructure. The article details the full journey: building hash-ring parity with Skipper's xxHash64 algorithm, implementing Kubernetes watch-based service discovery, fixing a slow deployment pipeline, rolling out with percentage-based traffic shifting, inventing N-ring fade-in to eliminate cold-cache scale-up spikes, replacing in-flight request count with occupancy (seconds-of-work-per-second via Little's Law) as the bounded-load signal, experimenting with AZ-aware routing, and hardening the fan-out path with retries, FIFO buffering, and latency-weighted routing.
Key Takeaways¶
-
Fan-out amplifies shared-infrastructure risk: A single batch request unpacks into 100 downstream calls through Skipper; latency tracks the slowest of 100 hops, not the median. Removing Skipper from this path eliminated latency spikes that had been misattributed for years (Source: "Skipper and the Fan-Out Problem" section).
-
Hash parity is the critical migration constraint: Both Skipper and the new CSLB must produce identical consistent-hash rings (xxHash64, 100 virtual nodes per endpoint) to prevent cache fragmentation during the canary period. Unit tests pin this invariant (Source: "Building the Same Hash Ring" section).
-
Watch-based discovery over polling: Switched from polling Kubernetes EndpointSlice API to a watch-based informer with 2-second debounce to coalesce scaling events into single ring updates, avoiding control-plane overload (Source: "Kubernetes Discovery" section).
-
Pipeline velocity enables safe experimentation: Median deployment time fell from 289 minutes to 128 minutes (worst case: 5 days → ~2 hours) via build caching, collapsing manual traffic steps, and sequenced market-group rollout. Over 100 PRs deployed in seven weeks (Source: "Fixing the Pipeline First" section).
-
N-ring fade-in eliminates scale-up spikes: Each HPA scale event creates a new ring that fades in over 30 seconds on a ^2.5 power curve. Multiple concurrent scale events each get independent windows. Pods warm on exactly the traffic they'll serve at steady state (Source: "Eliminating Scale-Up Spikes" section).
-
Occupancy > in-flight > throughput as a load signal: In-flight is instantaneous and local (misses hot-cache pods racing through 1ms hits). Throughput (requests/s) overstates load for fast responses. Occupancy (total_occupied_time / window_duration, i.e., Little's Law: L=λW) reveals true load and enables a looser balance factor (1.25 vs 1.10), reducing pod count 25% (Source: "Taming Pod Occupancy with Bounded Load" section).
-
Composite signal with latency weighting:
effectiveLoad = max(inflight, occupancy) × min(podLatency / globalLatency, 5). Slow pods weigh more; stuck pods (no completed responses) get the full 5× cap immediately (Source: "Little's Law" subsection). -
Walk cap prevents ring-wide stampede: Bounded-load walk is capped at 10 hops; if no pod is under threshold within 10, route to least-loaded seen. Prevents cascading redistribution during transient network events (Source: "Capping the Walk" section).
-
AZ-aware routing: promising but paused: Local-zone routing reduces inter-AZ transfer cost but fragments caches. Required a per-ring-weighted threshold for bounded load during fade-in. Paused due to edge-case bugs at the intersection of zone fade-in and N-ring scale-up (Source: "AZ-Aware Routing" section).
-
Node-level freezes surfaced by owning telemetry: Adding destination pod+node to error logs revealed brief (2-3s) node-level network freezes that had been invisible for years. CSLB's latency multiplier and retry-to-different-node handle them automatically (Source: "Hardening the Fan-Out Path" section).
Operational Numbers¶
| Metric | Before | After |
|---|---|---|
| Skipper fleet (PRAPI routes) | 50+ pods | 8 pods |
| Skipper daily cost | ~$450 | ~$110 |
| Pod occupancy range | 0.40–1.30 | 0.60–0.90 (then 1.0–1.5) |
| HPA threshold | 50% CPU | 65% CPU |
| Pod count reduction (occupancy) | baseline | −25% |
| Daily savings (occupancy) | — | >$1,000/day |
| Deployment median time | 289 min | 128 min |
| Deployment worst case | 4 days 21 hrs | ~2 hours |
| PRs shipped in 7 weeks | — | 100+ |
| Bounded-load walk p99 | — | 4 hops |
| N-ring fade-in window | — | 30 seconds |
| Fade-in curve | — | ^2.5 (power) |
Architecture¶
- Product-sets (batch component): unpacks batch → 100 parallel calls to Products pods
- Before: Product-sets → Skipper (shared ingress) → Products pods
- After: Product-sets → in-process CSLB (xxHash64 ring) → Products pods directly
- Fallback: Skipper path retained as emergency off-switch via ConfigMap
- Discovery: Kubernetes EndpointSlice watch-based informer, 2s debounce
- Hash ring: xxHash64, 100 virtual nodes per endpoint, binary search for clockwise nearest
Caveats¶
- AZ-aware routing is not in production; economics (DynamoDB read increase vs inter-AZ savings) are unproven at normal load — may only pay for itself during peak events like Cyber Week.
- The CSLB adds a new failure surface: stale rings if K8s API stalls, watch connections from hundreds of pods, RBAC for EndpointSlices, and on-call ownership of the load-balancer code.
- Node-level freezes are observed but root cause is unresolved (infrastructure team's domain).
- The article explicitly advises against building your own CSLB unless you're at the specific edge case of >1M rps internal fan-out on a single path.
Source¶
- Original: https://engineering.zalando.com/posts/2026/06/client-side-load-balancing.html
- Raw markdown:
raw/zalando/2026-06-22-client-side-load-balancing-at-a-million-requests-per-second-43e91032.md