SYSTEM Cited by 1 source
Zalando Route Server (routesrv)¶
Definition¶
Route Server (package:
github.com/zalando/skipper/routesrv)
is a Go proxy that sits between Skipper
and the Kubernetes API server and serves
compiled routing tables to all Skipper instances in a cluster. It
exists so data-plane Skippers do not each maintain their own watch
or poll against the Kubernetes API — at Zalando's fleet size
(~180–300 Skippers per cluster) that fan-out overwhelms etcd and
CPU-throttles the API server (Source:
sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).
Responsibilities¶
- Poll Kubernetes API for
IngressandRouteGroupresources on a 3-second interval. - Parse and validate resources into Skipper's Eskip routing DSL.
- Compute an ETag over the resulting routing table.
- Serve the routing table via HTTP (
/routes) with ETag conditional polling: Skipper sendsIf-None-Match: <current-etag>→ Route Server replies304 Not Modifiedif the ETag matches, or200 OKwith the full Eskip payload otherwise. - Absorb the N× fan-out of ~300 Skippers into a single cadence against the API server.
Key numbers¶
| Property | Value |
|---|---|
| Poll interval upstream | 3 seconds |
| Capacity per deployment | up to 100 rps |
| Skippers served per deploy | ≈ 300 (300 / 3 s = 100) |
| Skipper HPA ceiling (after) | 300 pods (up from ~180) |
| Deployed at | Zalando's 200 K8s clusters |
Operating modes (from the rollout flag)¶
- False — Route Server disabled. Skipper polls the Kubernetes API directly; this is the pre-migration path.
- Pre (shadow) — Route Server runs alongside Skipper but Skipper still polls the API directly. Operators compare routing tables with:
curl 'http://127.0.0.1:9911/routes?limit=10000000000000&nopretty' > skipper_routes.eskip
curl 'http://127.0.0.1:9090/routes' > routesrv_routes.eskip
git diff --no-index -- skipper_routes.eskip routesrv_routes.eskip
No production traffic depends on routesrv in this mode; it's the shadow stage of the three-mode rollout. - Exec — Skipper fetches routes from Route Server as the production control plane. Direct API polling from Skipper is disabled.
Failure modes (Route Server unavailable)¶
Two scenarios are enumerated in the launch post:
- Cold start, no initial routing table: with the Skipper
flag
-wait-first-route-loadenabled, the Skipper container fails to start. Fail-closed at boot. - Running Skipper, routesrv goes away: Skipper keeps serving the last known routing table (concepts/last-known-good-routing-table) — an explicit availability-over-consistency trade-off. Stale routes are preferred to no routes. An alert fires; an operator decides whether to fix routesrv or disable it. No automatic fallback to direct-API polling exists today (listed as future work).
Why not Kubernetes Informers?¶
Zalando explicitly rejected informers as the alternative (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster): "this approach would still require Kubernetes API to send information to all Skipper instances, which may lead to the same issues we faced. Since it's a sudden increase in traffic and HPA won't be able to catch up and scale Kubernetes API and etcd." Informers convert the N× poll into an N× push — the API server still talks to 300 clients at change events, which preserves the thundering-herd shape on the control plane. The single-proxy coalescer side-steps this entirely.
Seen in¶
- sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster
— the introduction post. Establishes Route Server as the
remediation for concepts/control-plane-fan-out-to-kubernetes-api
at Zalando's 200-cluster, 15k-ingress, 5k-routegroup,
2M-rps scale. Rolled out via
False→Pre→Execflag tier by tier with a 2-week test-cluster bake; zero GMV loss.
Comparable systems¶
- Kubernetes Informer / watch
multiplexers (e.g.
kube-state-metrics, shared-informer caches) — in-process solutions; routesrv is an out-of-process coalescing proxy with an HTTP+ETag wire protocol. - Envoy xDS servers — similar "central control plane serves many proxies" pattern, but xDS uses gRPC streams instead of HTTP polling with ETag. routesrv's choice is simpler at the cost of 3-second freshness floor (concepts/polling-interval-as-freshness-budget).
- External DNS — another controller-style single consumer of Kubernetes API in Zalando's stack. Unlike routesrv it writes to an external system (DNS provider) rather than fanning state back to many data-plane pods.
Related¶
- systems/skipper-proxy · systems/kubernetes · systems/kube-ingress-aws-controller · systems/external-dns
- concepts/control-plane-fan-out-to-kubernetes-api · concepts/etag-conditional-polling · concepts/last-known-good-routing-table · concepts/polling-interval-as-freshness-budget
- patterns/control-plane-proxy-with-etag-cache · patterns/three-mode-rollout-off-shadow-exec
- companies/zalando