Skip to content

SYSTEM Cited by 1 source

Zalando Route Server (routesrv)

Definition

Route Server (package: github.com/zalando/skipper/routesrv) is a Go proxy that sits between Skipper and the Kubernetes API server and serves compiled routing tables to all Skipper instances in a cluster. It exists so data-plane Skippers do not each maintain their own watch or poll against the Kubernetes API — at Zalando's fleet size (~180–300 Skippers per cluster) that fan-out overwhelms etcd and CPU-throttles the API server (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster).

Responsibilities

  • Poll Kubernetes API for Ingress and RouteGroup resources on a 3-second interval.
  • Parse and validate resources into Skipper's Eskip routing DSL.
  • Compute an ETag over the resulting routing table.
  • Serve the routing table via HTTP (/routes) with ETag conditional polling: Skipper sends If-None-Match: <current-etag> → Route Server replies 304 Not Modified if the ETag matches, or 200 OK with the full Eskip payload otherwise.
  • Absorb the N× fan-out of ~300 Skippers into a single cadence against the API server.

Key numbers

Property Value
Poll interval upstream 3 seconds
Capacity per deployment up to 100 rps
Skippers served per deploy 300 (300 / 3 s = 100)
Skipper HPA ceiling (after) 300 pods (up from ~180)
Deployed at Zalando's 200 K8s clusters

Operating modes (from the rollout flag)

  • False — Route Server disabled. Skipper polls the Kubernetes API directly; this is the pre-migration path.
  • Pre (shadow) — Route Server runs alongside Skipper but Skipper still polls the API directly. Operators compare routing tables with:
curl 'http://127.0.0.1:9911/routes?limit=10000000000000&nopretty' > skipper_routes.eskip
curl 'http://127.0.0.1:9090/routes' > routesrv_routes.eskip
git diff --no-index -- skipper_routes.eskip routesrv_routes.eskip

No production traffic depends on routesrv in this mode; it's the shadow stage of the three-mode rollout. - Exec — Skipper fetches routes from Route Server as the production control plane. Direct API polling from Skipper is disabled.

Failure modes (Route Server unavailable)

Two scenarios are enumerated in the launch post:

  1. Cold start, no initial routing table: with the Skipper flag -wait-first-route-load enabled, the Skipper container fails to start. Fail-closed at boot.
  2. Running Skipper, routesrv goes away: Skipper keeps serving the last known routing table (concepts/last-known-good-routing-table) — an explicit availability-over-consistency trade-off. Stale routes are preferred to no routes. An alert fires; an operator decides whether to fix routesrv or disable it. No automatic fallback to direct-API polling exists today (listed as future work).

Why not Kubernetes Informers?

Zalando explicitly rejected informers as the alternative (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster): "this approach would still require Kubernetes API to send information to all Skipper instances, which may lead to the same issues we faced. Since it's a sudden increase in traffic and HPA won't be able to catch up and scale Kubernetes API and etcd." Informers convert the N× poll into an N× push — the API server still talks to 300 clients at change events, which preserves the thundering-herd shape on the control plane. The single-proxy coalescer side-steps this entirely.

Seen in

Comparable systems

  • Kubernetes Informer / watch multiplexers (e.g. kube-state-metrics, shared-informer caches) — in-process solutions; routesrv is an out-of-process coalescing proxy with an HTTP+ETag wire protocol.
  • Envoy xDS servers — similar "central control plane serves many proxies" pattern, but xDS uses gRPC streams instead of HTTP polling with ETag. routesrv's choice is simpler at the cost of 3-second freshness floor (concepts/polling-interval-as-freshness-budget).
  • External DNS — another controller-style single consumer of Kubernetes API in Zalando's stack. Unlike routesrv it writes to an external system (DNS provider) rather than fanning state back to many data-plane pods.
Last updated · 501 distilled / 1,218 read