ZALANDO

Zalando — Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster¶

Summary¶

Zalando's platform team runs Skipper as the default Kubernetes Ingress proxy across 200 clusters with ~180 Skipper instances per cluster serving up to 2M requests/second against 15,000 Ingresses + 5,000 RouteGroups. Each Skipper pod independently polled the Kubernetes API for Ingress and RouteGroup resources. At ~180 replicas that fan-out became a structural load on etcd and the API server: etcd was overwhelmed, the API server CPU-throttled, and the control plane lost the ability to schedule new pods — a scheduler-level failure reachable by an ingress-fleet growth curve, not by any per-request load. The remediation was to insert a new proxy tier — Route Server (pkg.go.dev route for routesrv) — between Skipper and the Kubernetes API. Route Server polls the API once every 3 seconds, parses routes into Eskip, and serves the compiled routing table to all Skipper instances behind an HTTP ETag cache: Skipper sends its current ETag; if unchanged, Route Server replies 304 Not Modified. If Route Server is unreachable after startup, Skipper keeps serving its last-known-good routing table (availability ≫ consistency on the data plane). Rolled out through three explicit flag modes — False (off), Pre (shadow: run routesrv alongside and diff routing tables), Exec (production: Skipper fetches from routesrv) — tier by tier across clusters. Results: zero downtime, zero GMV loss, and Skipper HPA extended from ~180 to 300 pods per cluster, with one Route Server deployment (capacity ~100 rps) comfortably serving ~300 Skippers.

Key takeaways¶

Control-plane fan-out of many identical watchers to the Kubernetes API is itself the overload vector (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). At 180 Skipper instances per cluster, each polling the API for the same Ingress + RouteGroup set, etcd was "overwhelmed" and the API server's CPU was throttled, producing control-plane stability risk — "our clusters lost the ability to schedule new pods effectively, and existing pod management operations began to fail." The outage surface is the scheduler, not the ingress data plane. Canonical instance of concepts/control-plane-fan-out-to-kubernetes-api.
Insert a single coalescing proxy in front of the shared dependency (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). Route Server (routesrv) polls the Kubernetes API every 3 seconds, parses once, and serves the cached routing table to all ~300 Skippers. The polling rate against the API drops from N × once-per-skipper-poll to exactly 1 × 3s.
HTTP ETag + 304 is the on-wire protocol (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). Skipper includes its current ETag in each update request; Route Server compares to its computed ETag of the current routing table and replies 304 Not Modified when they match, skipping a payload the size of 15k ingresses + 5k routegroups. Full payload is sent only on change. Canonical instance of concepts/etag-conditional-polling applied to an internal control-plane channel.
Last-known-good routing table is the availability guarantee (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). If Route Server is unreachable after Skipper has loaded an initial routing table, Skipper "will continue to work with the last known routing table" — "a trade-off between availability and consistency." Two failure modes are explicitly enumerated: (a) Skipper starts with no routing table → container fails to start under -wait-first-route-load (fail-closed at boot); (b) Skipper has routes but routesrv goes away → keep serving stale routes (fail-open post-boot). No automatic fallback to the pre-routesrv direct-API polling exists yet. concepts/last-known-good-routing-table.
3-second poll interval is the freshness budget (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). The interval is the upper bound on how long a new Ingress or RouteGroup takes to reach Skipper's routing table. It's the price paid for coalescing — callers upstream of a 3-second-polling proxy can't get sub-3-second freshness. The post doesn't separately justify 3 seconds, but it's the user-visible propagation floor for every route change at Zalando. concepts/polling-interval-as-freshness-budget.
Three-mode flag (False / Pre / Exec) is the rollout shape (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). Mode False keeps the old direct-polling path. Mode Pre runs routesrv in shadow: both Skipper's self-computed routing table and routesrv's computed table are available via curl endpoints, and operators git diff the two Eskip outputs to catch divergence before any pod in production starts consuming routesrv. Mode Exec switches Skipper to fetch from routesrv as the production control plane. Clusters were promoted tier by tier: test → production-low-tier → … Canonical instance of patterns/three-mode-rollout-off-shadow-exec; the shadow-diff step is what let Zalando commit route-table equivalence with no risk to GMV.
Kubernetes Informers were explicitly rejected (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). Informers (push-based watches) would still require the API server to fan information out to all 180+ Skippers on every change. "Since it's a sudden increase in traffic and HPA won't be able to catch up and scale Kubernetes API and etcd" — informer fan-out would reproduce the original overload pattern at change events. The single-proxy coalescer side-steps this entirely.
Capacity numbers: one routesrv deployment handles up to 100 rps — equivalent to ~300 Skipper pods at 3- second intervals (300 / 3s = 100 rps). Skipper's HPA ceiling was extended from ~180 pods (the overload threshold) to 300 pods as a direct consequence of the Route Server rollout.

Systems extracted¶

systems/zalando-route-server — new. The Go proxy added between Skipper and the Kubernetes API. Polls the API every 3 seconds, parses Ingress + RouteGroup into Eskip, exposes an HTTP endpoint (/routes) with ETag semantics, and serves all Skippers in the cluster. Package: github.com/zalando/skipper/routesrv.
systems/skipper-proxy — pre-existing. Stops watching the Kubernetes API directly and becomes a Route Server client in Exec mode.
systems/kubernetes — pre-existing. The shared dependency whose etcd and API-server CPU is the scaling bottleneck; informer fan-out rejected as a remediation.
systems/kube-ingress-aws-controller / systems/external-dns — pre-existing; mentioned as part of the surrounding ingress stack (ALB + DNS + TLS) without being changed.

Concepts extracted¶

concepts/control-plane-fan-out-to-kubernetes-api — new. The anti-pattern of N identical watchers (Skipper-like data-plane pods) each independently polling or watching the API for the same resources, producing N× load on etcd / apiserver as a function of data-plane replica count.
concepts/etag-conditional-polling — new. Client sends last-seen ETag on each request; server replies 304 Not Modified if unchanged, full payload otherwise. Applied here as the routesrv ↔ Skipper wire protocol.
concepts/last-known-good-routing-table — new. The availability-over-consistency fallback: a data-plane proxy keeps serving the last routing table it received if its control plane goes dark, on the theory that stale routes are better than no routes.
concepts/polling-interval-as-freshness-budget — new. The intentional consequence of a coalescing proxy: the poll interval is the user-visible lower bound on how long a config change takes to reach the data plane.
concepts/thundering-herd — extended. Informer fan-out at change events is explicitly named as reproducing the thundering-herd shape against the API server + etcd, which is why Zalando rejected it as an alternative.

Patterns extracted¶

patterns/control-plane-proxy-with-etag-cache — new. Decouple data-plane pods from a shared upstream (K8s API, auth server, config store) by inserting a single proxy that polls or watches upstream at its own cadence and serves downstream pods via HTTP ETag / 304. Converts an N× fan-out on the upstream into a 1× poll + N× 304-gated delta channel.
patterns/three-mode-rollout-off-shadow-exec — new. A three-position feature-flag shape for rolling out a component that sits on the critical path: off (legacy path), shadow (run new component in parallel, diff outputs via observability), exec (new component is the production control plane). The shadow mode is the non-optional middle step — it's what lets the team commit routing-table-old == routing-table-new before any traffic depends on the new component.

Operational numbers¶

Metric	Value
Kubernetes clusters	200
Ingresses	15,000
RouteGroups	5,000
Peak traffic	up to 2,000,000 rps
Auth service-to-service share	80–90% of traffic (500k–1M rps)
Skipper instances per cluster (pre-fix)	~180
Skipper HPA ceiling (post-fix)	300
Route Server poll interval	3 seconds
Route Server capacity per deployment	~100 rps (≈300 Skippers @ 3 s)
Test-cluster bake time before production	2 weeks
Rollout modes	`False` / `Pre` / `Exec`
GMV loss during rollout	0

Caveats¶

No measurements of the original overload are given. The post states the ~180-Skipper fan-out "began to overwhelm our etcd infrastructure" and caused API-server CPU throttling, but does not quote etcd req/s, API-server CPU %, or the precise Skipper-count threshold at which scheduling failed. The 300-pod HPA cap is quoted post-fix, not the pre-fix ceiling that triggered the project.
No automatic fallback is implemented yet. The post names "Automatic Fallback" as future work — today, Route Server being unavailable at boot means Skipper fails to start (with -wait-first-route-load); post-boot it runs on stale routes until an operator "fix[es] the Route Server or disable[s] it." No programmatic rollback to direct-API polling exists.
Route Server is itself a new SPOF. The post touches on this (two failure scenarios, an explicit availability / consistency trade-off) but does not give the Route Server deployment shape — replica count, leader election, whether multiple routesrv pods serve behind a Service, or failure semantics when one routesrv pod dies mid-serve. "One RouteSRV deployment can handle up to 100 RPS" implies a single deployment per cluster but doesn't pin the replica count.
3-second poll interval is not justified. It could be lower (tighter freshness budget, more load on the API) or higher (less load, slower change propagation). No experiment / measurement is cited.
Rollout categorisation is summarised, not detailed. Production tiers are referenced ("tier by tier") but not enumerated — there's no mapping from cluster category to risk profile or traffic share. The 2-week bake in test clusters is the only numeric rollout artefact.
ETag granularity not specified. The ETag is computed over the full routing table, so any route change produces a 200 + full-payload response for all ~300 Skippers until they re-sync. At 20k total routes this is a substantial payload; the post doesn't quantify it or discuss incremental / per-route ETags.
Alternatives-considered section is one paragraph. Informers are the only rejected alternative explicitly named. Watch proxies, event-stream caching proxies, or a push-based Route Server (gRPC stream to Skippers) are not discussed.