ZALANDO 2026-06-22

Client-Side Load Balancing at a Million Requests Per Second¶

Summary¶

Zalando's Product Read API (PRAPI) team replaced the shared Skipper ingress load balancer on their internal fan-out path with an in-process client-side load balancer (CSLB), eliminating over a million requests per second from shared infrastructure. The article details the full journey: building hash-ring parity with Skipper's xxHash64 algorithm, implementing Kubernetes watch-based service discovery, fixing a slow deployment pipeline, rolling out with percentage-based traffic shifting, inventing N-ring fade-in to eliminate cold-cache scale-up spikes, replacing in-flight request count with occupancy (seconds-of-work-per-second via Little's Law) as the bounded-load signal, experimenting with AZ-aware routing, and hardening the fan-out path with retries, FIFO buffering, and latency-weighted routing.

Key Takeaways¶

Fan-out amplifies shared-infrastructure risk: A single batch request unpacks into 100 downstream calls through Skipper; latency tracks the slowest of 100 hops, not the median. Removing Skipper from this path eliminated latency spikes that had been misattributed for years (Source: "Skipper and the Fan-Out Problem" section).
Hash parity is the critical migration constraint: Both Skipper and the new CSLB must produce identical consistent-hash rings (xxHash64, 100 virtual nodes per endpoint) to prevent cache fragmentation during the canary period. Unit tests pin this invariant (Source: "Building the Same Hash Ring" section).
Watch-based discovery over polling: Switched from polling Kubernetes EndpointSlice API to a watch-based informer with 2-second debounce to coalesce scaling events into single ring updates, avoiding control-plane overload (Source: "Kubernetes Discovery" section).
Pipeline velocity enables safe experimentation: Median deployment time fell from 289 minutes to 128 minutes (worst case: 5 days → ~2 hours) via build caching, collapsing manual traffic steps, and sequenced market-group rollout. Over 100 PRs deployed in seven weeks (Source: "Fixing the Pipeline First" section).
N-ring fade-in eliminates scale-up spikes: Each HPA scale event creates a new ring that fades in over 30 seconds on a ^2.5 power curve. Multiple concurrent scale events each get independent windows. Pods warm on exactly the traffic they'll serve at steady state (Source: "Eliminating Scale-Up Spikes" section).
Occupancy > in-flight > throughput as a load signal: In-flight is instantaneous and local (misses hot-cache pods racing through 1ms hits). Throughput (requests/s) overstates load for fast responses. Occupancy (total_occupied_time / window_duration, i.e., Little's Law: L=λW) reveals true load and enables a looser balance factor (1.25 vs 1.10), reducing pod count 25% (Source: "Taming Pod Occupancy with Bounded Load" section).
Composite signal with latency weighting: effectiveLoad = max(inflight, occupancy) × min(podLatency / globalLatency, 5). Slow pods weigh more; stuck pods (no completed responses) get the full 5× cap immediately (Source: "Little's Law" subsection).
Walk cap prevents ring-wide stampede: Bounded-load walk is capped at 10 hops; if no pod is under threshold within 10, route to least-loaded seen. Prevents cascading redistribution during transient network events (Source: "Capping the Walk" section).
AZ-aware routing: promising but paused: Local-zone routing reduces inter-AZ transfer cost but fragments caches. Required a per-ring-weighted threshold for bounded load during fade-in. Paused due to edge-case bugs at the intersection of zone fade-in and N-ring scale-up (Source: "AZ-Aware Routing" section).
Node-level freezes surfaced by owning telemetry: Adding destination pod+node to error logs revealed brief (2-3s) node-level network freezes that had been invisible for years. CSLB's latency multiplier and retry-to-different-node handle them automatically (Source: "Hardening the Fan-Out Path" section).

Operational Numbers¶

Metric	Before	After
Skipper fleet (PRAPI routes)	50+ pods	8 pods
Skipper daily cost	~$450	~$110
Pod occupancy range	0.40–1.30	0.60–0.90 (then 1.0–1.5)
HPA threshold	50% CPU	65% CPU
Pod count reduction (occupancy)	baseline	−25%
Daily savings (occupancy)	—	>$1,000/day
Deployment median time	289 min	128 min
Deployment worst case	4 days 21 hrs	~2 hours
PRs shipped in 7 weeks	—	100+
Bounded-load walk p99	—	4 hops
N-ring fade-in window	—	30 seconds
Fade-in curve	—	^2.5 (power)

Architecture¶

Product-sets (batch component): unpacks batch → 100 parallel calls to Products pods
Before: Product-sets → Skipper (shared ingress) → Products pods
After: Product-sets → in-process CSLB (xxHash64 ring) → Products pods directly
Fallback: Skipper path retained as emergency off-switch via ConfigMap
Discovery: Kubernetes EndpointSlice watch-based informer, 2s debounce
Hash ring: xxHash64, 100 virtual nodes per endpoint, binary search for clockwise nearest

Caveats¶

AZ-aware routing is not in production; economics (DynamoDB read increase vs inter-AZ savings) are unproven at normal load — may only pay for itself during peak events like Cyber Week.
The CSLB adds a new failure surface: stale rings if K8s API stalls, watch connections from hundreds of pods, RBAC for EndpointSlices, and on-call ownership of the load-balancer code.
Node-level freezes are observed but root cause is unresolved (infrastructure team's domain).
The article explicitly advises against building your own CSLB unless you're at the specific edge case of >1M rps internal fan-out on a single path.