Zalando — Scaling Beyond Limits: Harnessing Route Server for a Stable Cluster¶
Summary¶
Zalando's platform team runs Skipper as
the default Kubernetes Ingress proxy across 200 clusters with
~180 Skipper instances per cluster serving up to 2M
requests/second against 15,000 Ingresses + 5,000 RouteGroups.
Each Skipper pod independently polled the Kubernetes API for
Ingress and RouteGroup resources. At ~180 replicas that
fan-out became a structural load on etcd and the API server:
etcd was overwhelmed, the API server CPU-throttled, and the
control plane lost the ability to schedule new pods — a
scheduler-level failure reachable by an ingress-fleet growth
curve, not by any per-request load. The remediation was to
insert a new proxy tier — Route Server (pkg.go.dev route
for routesrv)
— between Skipper and the Kubernetes API. Route Server polls the
API once every 3 seconds, parses routes into Eskip, and serves
the compiled routing table to all Skipper instances behind an
HTTP ETag cache: Skipper sends its current ETag; if
unchanged, Route Server replies 304 Not Modified. If Route
Server is unreachable after startup, Skipper keeps serving its
last-known-good routing table (availability ≫ consistency
on the data plane). Rolled out through three explicit flag
modes — False (off), Pre (shadow: run routesrv alongside
and diff routing tables), Exec (production: Skipper fetches
from routesrv) — tier by tier across clusters. Results: zero
downtime, zero GMV loss, and Skipper HPA extended from ~180 to
300 pods per cluster, with one Route Server deployment
(capacity ~100 rps) comfortably serving ~300 Skippers.
Key takeaways¶
- Control-plane fan-out of many identical watchers to the
Kubernetes API is itself the overload vector (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
At 180 Skipper instances per cluster, each polling the API for
the same
Ingress+RouteGroupset, etcd was "overwhelmed" and the API server's CPU was throttled, producing control-plane stability risk — "our clusters lost the ability to schedule new pods effectively, and existing pod management operations began to fail." The outage surface is the scheduler, not the ingress data plane. Canonical instance of concepts/control-plane-fan-out-to-kubernetes-api. - Insert a single coalescing proxy in front of the shared
dependency (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
Route Server (
routesrv) polls the Kubernetes API every 3 seconds, parses once, and serves the cached routing table to all ~300 Skippers. The polling rate against the API drops fromN × once-per-skipper-pollto exactly1 × 3s. - HTTP ETag + 304 is the on-wire protocol (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
Skipper includes its current ETag in each update request;
Route Server compares to its computed ETag of the current
routing table and replies
304 Not Modifiedwhen they match, skipping a payload the size of 15k ingresses + 5k routegroups. Full payload is sent only on change. Canonical instance of concepts/etag-conditional-polling applied to an internal control-plane channel. - Last-known-good routing table is the availability
guarantee (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
If Route Server is unreachable after Skipper has loaded an
initial routing table, Skipper "will continue to work with
the last known routing table" — "a trade-off between
availability and consistency." Two failure modes are
explicitly enumerated: (a) Skipper starts with no routing
table → container fails to start under
-wait-first-route-load(fail-closed at boot); (b) Skipper has routes but routesrv goes away → keep serving stale routes (fail-open post-boot). No automatic fallback to the pre-routesrv direct-API polling exists yet. concepts/last-known-good-routing-table. - 3-second poll interval is the freshness budget
(Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
The interval is the upper bound on how long a new
IngressorRouteGrouptakes to reach Skipper's routing table. It's the price paid for coalescing — callers upstream of a 3-second-polling proxy can't get sub-3-second freshness. The post doesn't separately justify 3 seconds, but it's the user-visible propagation floor for every route change at Zalando. concepts/polling-interval-as-freshness-budget. - Three-mode flag (
False/Pre/Exec) is the rollout shape (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). ModeFalsekeeps the old direct-polling path. ModePreruns routesrv in shadow: both Skipper's self-computed routing table and routesrv's computed table are available viacurlendpoints, and operatorsgit diffthe two Eskip outputs to catch divergence before any pod in production starts consuming routesrv. ModeExecswitches Skipper to fetch from routesrv as the production control plane. Clusters were promoted tier by tier: test → production-low-tier → … Canonical instance of patterns/three-mode-rollout-off-shadow-exec; the shadow-diff step is what let Zalando commit route-table equivalence with no risk to GMV. - Kubernetes Informers were explicitly rejected (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). Informers (push-based watches) would still require the API server to fan information out to all 180+ Skippers on every change. "Since it's a sudden increase in traffic and HPA won't be able to catch up and scale Kubernetes API and etcd" — informer fan-out would reproduce the original overload pattern at change events. The single-proxy coalescer side-steps this entirely.
- Capacity numbers: one routesrv deployment handles up to 100 rps — equivalent to ~300 Skipper pods at 3- second intervals (300 / 3s = 100 rps). Skipper's HPA ceiling was extended from ~180 pods (the overload threshold) to 300 pods as a direct consequence of the Route Server rollout.
Systems extracted¶
- systems/zalando-route-server — new. The Go proxy added
between Skipper and the Kubernetes API. Polls the API every
3 seconds, parses Ingress + RouteGroup into Eskip, exposes
an HTTP endpoint (
/routes) with ETag semantics, and serves all Skippers in the cluster. Package:github.com/zalando/skipper/routesrv. - systems/skipper-proxy — pre-existing. Stops watching the
Kubernetes API directly and becomes a Route Server client in
Execmode. - systems/kubernetes — pre-existing. The shared dependency whose etcd and API-server CPU is the scaling bottleneck; informer fan-out rejected as a remediation.
- systems/kube-ingress-aws-controller / systems/external-dns — pre-existing; mentioned as part of the surrounding ingress stack (ALB + DNS + TLS) without being changed.
Concepts extracted¶
- concepts/control-plane-fan-out-to-kubernetes-api — new. The anti-pattern of N identical watchers (Skipper-like data-plane pods) each independently polling or watching the API for the same resources, producing N× load on etcd / apiserver as a function of data-plane replica count.
- concepts/etag-conditional-polling — new. Client sends
last-seen ETag on each request; server replies
304 Not Modifiedif unchanged, full payload otherwise. Applied here as the routesrv ↔ Skipper wire protocol. - concepts/last-known-good-routing-table — new. The availability-over-consistency fallback: a data-plane proxy keeps serving the last routing table it received if its control plane goes dark, on the theory that stale routes are better than no routes.
- concepts/polling-interval-as-freshness-budget — new. The intentional consequence of a coalescing proxy: the poll interval is the user-visible lower bound on how long a config change takes to reach the data plane.
- concepts/thundering-herd — extended. Informer fan-out at change events is explicitly named as reproducing the thundering-herd shape against the API server + etcd, which is why Zalando rejected it as an alternative.
Patterns extracted¶
- patterns/control-plane-proxy-with-etag-cache — new. Decouple data-plane pods from a shared upstream (K8s API, auth server, config store) by inserting a single proxy that polls or watches upstream at its own cadence and serves downstream pods via HTTP ETag / 304. Converts an N× fan-out on the upstream into a 1× poll + N× 304-gated delta channel.
- patterns/three-mode-rollout-off-shadow-exec — new.
A three-position feature-flag shape for rolling out a
component that sits on the critical path: off (legacy
path), shadow (run new component in parallel, diff
outputs via observability), exec (new component is the
production control plane). The shadow mode is the
non-optional middle step — it's what lets the team commit
routing-table-old == routing-table-newbefore any traffic depends on the new component.
Operational numbers¶
| Metric | Value |
|---|---|
| Kubernetes clusters | 200 |
| Ingresses | 15,000 |
| RouteGroups | 5,000 |
| Peak traffic | up to 2,000,000 rps |
| Auth service-to-service share | 80–90% of traffic (500k–1M rps) |
| Skipper instances per cluster (pre-fix) | ~180 |
| Skipper HPA ceiling (post-fix) | 300 |
| Route Server poll interval | 3 seconds |
| Route Server capacity per deployment | ~100 rps (≈300 Skippers @ 3 s) |
| Test-cluster bake time before production | 2 weeks |
| Rollout modes | False / Pre / Exec |
| GMV loss during rollout | 0 |
Caveats¶
- No measurements of the original overload are given. The post states the ~180-Skipper fan-out "began to overwhelm our etcd infrastructure" and caused API-server CPU throttling, but does not quote etcd req/s, API-server CPU %, or the precise Skipper-count threshold at which scheduling failed. The 300-pod HPA cap is quoted post-fix, not the pre-fix ceiling that triggered the project.
- No automatic fallback is implemented yet. The post names
"Automatic Fallback" as future work — today, Route Server
being unavailable at boot means Skipper fails to start
(with
-wait-first-route-load); post-boot it runs on stale routes until an operator "fix[es] the Route Server or disable[s] it." No programmatic rollback to direct-API polling exists. - Route Server is itself a new SPOF. The post touches on this (two failure scenarios, an explicit availability / consistency trade-off) but does not give the Route Server deployment shape — replica count, leader election, whether multiple routesrv pods serve behind a Service, or failure semantics when one routesrv pod dies mid-serve. "One RouteSRV deployment can handle up to 100 RPS" implies a single deployment per cluster but doesn't pin the replica count.
- 3-second poll interval is not justified. It could be lower (tighter freshness budget, more load on the API) or higher (less load, slower change propagation). No experiment / measurement is cited.
- Rollout categorisation is summarised, not detailed. Production tiers are referenced ("tier by tier") but not enumerated — there's no mapping from cluster category to risk profile or traffic share. The 2-week bake in test clusters is the only numeric rollout artefact.
- ETag granularity not specified. The ETag is computed over the full routing table, so any route change produces a 200 + full-payload response for all ~300 Skippers until they re-sync. At 20k total routes this is a substantial payload; the post doesn't quantify it or discuss incremental / per-route ETags.
- Alternatives-considered section is one paragraph. Informers are the only rejected alternative explicitly named. Watch proxies, event-stream caching proxies, or a push-based Route Server (gRPC stream to Skippers) are not discussed.
Source¶
- Original: https://engineering.zalando.com/posts/2025/02/scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster.html
- Raw markdown:
raw/zalando/2025-02-16-scaling-beyond-limits-harnessing-route-server-for-a-stable-c-bd443b75.md
Related¶
- systems/zalando-route-server · systems/skipper-proxy · systems/kubernetes · systems/kube-ingress-aws-controller · systems/external-dns
- concepts/control-plane-fan-out-to-kubernetes-api · concepts/etag-conditional-polling · concepts/last-known-good-routing-table · concepts/polling-interval-as-freshness-budget · concepts/thundering-herd
- patterns/control-plane-proxy-with-etag-cache · patterns/three-mode-rollout-off-shadow-exec
- companies/zalando