Skip to content

ZALANDO 2026-06-22

Read original ↗

Client-Side Load Balancing at a Million Requests Per Second

Summary

Zalando's Product Read API (PRAPI) team replaced the shared Skipper ingress load balancer on their internal fan-out path with an in-process client-side load balancer (CSLB), eliminating over a million requests per second from shared infrastructure. The article details the full journey: building hash-ring parity with Skipper's xxHash64 algorithm, implementing Kubernetes watch-based service discovery, fixing a slow deployment pipeline, rolling out with percentage-based traffic shifting, inventing N-ring fade-in to eliminate cold-cache scale-up spikes, replacing in-flight request count with occupancy (seconds-of-work-per-second via Little's Law) as the bounded-load signal, experimenting with AZ-aware routing, and hardening the fan-out path with retries, FIFO buffering, and latency-weighted routing.

Key Takeaways

  1. Fan-out amplifies shared-infrastructure risk: A single batch request unpacks into 100 downstream calls through Skipper; latency tracks the slowest of 100 hops, not the median. Removing Skipper from this path eliminated latency spikes that had been misattributed for years (Source: "Skipper and the Fan-Out Problem" section).

  2. Hash parity is the critical migration constraint: Both Skipper and the new CSLB must produce identical consistent-hash rings (xxHash64, 100 virtual nodes per endpoint) to prevent cache fragmentation during the canary period. Unit tests pin this invariant (Source: "Building the Same Hash Ring" section).

  3. Watch-based discovery over polling: Switched from polling Kubernetes EndpointSlice API to a watch-based informer with 2-second debounce to coalesce scaling events into single ring updates, avoiding control-plane overload (Source: "Kubernetes Discovery" section).

  4. Pipeline velocity enables safe experimentation: Median deployment time fell from 289 minutes to 128 minutes (worst case: 5 days → ~2 hours) via build caching, collapsing manual traffic steps, and sequenced market-group rollout. Over 100 PRs deployed in seven weeks (Source: "Fixing the Pipeline First" section).

  5. N-ring fade-in eliminates scale-up spikes: Each HPA scale event creates a new ring that fades in over 30 seconds on a ^2.5 power curve. Multiple concurrent scale events each get independent windows. Pods warm on exactly the traffic they'll serve at steady state (Source: "Eliminating Scale-Up Spikes" section).

  6. Occupancy > in-flight > throughput as a load signal: In-flight is instantaneous and local (misses hot-cache pods racing through 1ms hits). Throughput (requests/s) overstates load for fast responses. Occupancy (total_occupied_time / window_duration, i.e., Little's Law: L=λW) reveals true load and enables a looser balance factor (1.25 vs 1.10), reducing pod count 25% (Source: "Taming Pod Occupancy with Bounded Load" section).

  7. Composite signal with latency weighting: effectiveLoad = max(inflight, occupancy) × min(podLatency / globalLatency, 5). Slow pods weigh more; stuck pods (no completed responses) get the full 5× cap immediately (Source: "Little's Law" subsection).

  8. Walk cap prevents ring-wide stampede: Bounded-load walk is capped at 10 hops; if no pod is under threshold within 10, route to least-loaded seen. Prevents cascading redistribution during transient network events (Source: "Capping the Walk" section).

  9. AZ-aware routing: promising but paused: Local-zone routing reduces inter-AZ transfer cost but fragments caches. Required a per-ring-weighted threshold for bounded load during fade-in. Paused due to edge-case bugs at the intersection of zone fade-in and N-ring scale-up (Source: "AZ-Aware Routing" section).

  10. Node-level freezes surfaced by owning telemetry: Adding destination pod+node to error logs revealed brief (2-3s) node-level network freezes that had been invisible for years. CSLB's latency multiplier and retry-to-different-node handle them automatically (Source: "Hardening the Fan-Out Path" section).

Operational Numbers

Metric Before After
Skipper fleet (PRAPI routes) 50+ pods 8 pods
Skipper daily cost ~$450 ~$110
Pod occupancy range 0.40–1.30 0.60–0.90 (then 1.0–1.5)
HPA threshold 50% CPU 65% CPU
Pod count reduction (occupancy) baseline −25%
Daily savings (occupancy) >$1,000/day
Deployment median time 289 min 128 min
Deployment worst case 4 days 21 hrs ~2 hours
PRs shipped in 7 weeks 100+
Bounded-load walk p99 4 hops
N-ring fade-in window 30 seconds
Fade-in curve ^2.5 (power)

Architecture

  • Product-sets (batch component): unpacks batch → 100 parallel calls to Products pods
  • Before: Product-sets → Skipper (shared ingress) → Products pods
  • After: Product-sets → in-process CSLB (xxHash64 ring) → Products pods directly
  • Fallback: Skipper path retained as emergency off-switch via ConfigMap
  • Discovery: Kubernetes EndpointSlice watch-based informer, 2s debounce
  • Hash ring: xxHash64, 100 virtual nodes per endpoint, binary search for clockwise nearest

Caveats

  • AZ-aware routing is not in production; economics (DynamoDB read increase vs inter-AZ savings) are unproven at normal load — may only pay for itself during peak events like Cyber Week.
  • The CSLB adds a new failure surface: stale rings if K8s API stalls, watch connections from hundreds of pods, RBAC for EndpointSlices, and on-call ownership of the load-balancer code.
  • Node-level freezes are observed but root cause is unresolved (infrastructure team's domain).
  • The article explicitly advises against building your own CSLB unless you're at the specific edge case of >1M rps internal fan-out on a single path.

Source

Last updated · 559 distilled / 1,651 read