PATTERN Cited by 1 source
Control-plane proxy with ETag cache¶
Intent¶
Decouple a large data-plane fleet from a shared upstream config source (Kubernetes API, service registry, auth server, any config store) by inserting a single coalescing proxy between them. The proxy polls or watches the upstream at one cadence and serves downstream pods over HTTP using ETag conditional polling. Converts an N× fan-out on the shared upstream into a 1× poll + N× 304-gated delta channel on a cheap HTTP tier.
Structure¶
┌────────────────┐
│ Data-plane pod │
│ (Skipper) │
└───────▲────────┘
│ HTTP + ETag
│ (every Δ)
┌──────────────┐ poll @ Δ ┌──────────────┴─────┐
│ Upstream │ ◄─────────────── │ Coalescing proxy │
│ (K8s API, │ │ (Route Server) │ ◄── ... N more clients
│ etcd, ...) │ │ - parse + compile │
└──────────────┘ │ - compute ETag │
│ - serve /routes │
└────────────────────┘
Exactly one connection against the upstream; N cheap HTTP
connections against the proxy.
Canonical instance¶
Zalando's Route Server (routesrv) (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster):
- Upstream: Kubernetes API (
Ingress+RouteGroupresources). - Proxy: Route Server — polls the API every 3 seconds,
parses into Eskip, computes a table-wide ETag, serves
GET /routeswith HTTP 200 or 304. - Data plane: ~300 Skipper pods per cluster.
- Result: ~180-Skipper load on etcd (enough to break pod scheduling) → 1× polling on etcd + 100 rps of cheap 304s from Route Server to Skipper. HPA ceiling raised from 180 to 300 pods with zero downtime, zero GMV loss.
When to apply¶
- Data-plane replica count is high (hundreds+) and each replica needs roughly the same config.
- The config source is a shared scaling bottleneck — typical examples: Kubernetes API + etcd, a database, an auth server, a config repo you'd otherwise poll.
- Change volume is low relative to poll rate (so most polls return 304 — ETag caching pays off).
- Kubernetes Informers are not a sufficient answer — they preserve the N× fan-out against the API server at change events (Zalando's reasoning for rejecting them: informer push still requires the API server to fan out to all N clients, which under a burst reproduces the same thundering herd on etcd + API server).
Trade-offs¶
- Freshness floor equals poll interval. Operators upstream of the proxy can't get sub-interval propagation to data- plane pods — see concepts/polling-interval-as-freshness-budget.
- The proxy is a new SPOF. Mitigate with last-known-good fallback on the data plane so the proxy being down is a freshness regression, not a traffic outage.
- ETag granularity is a real design choice. One ETag per whole table (Zalando's choice) means every change triggers a full refetch by all clients. Per-resource ETags narrow that at the cost of server complexity.
- Rollout risk. Inserting a new control-plane tier on the critical path is high-stakes — pair with three-mode rollout to diff old vs new outputs before cutover.
Operational numbers (Zalando)¶
| Dimension | Before | After |
|---|---|---|
| Who polls K8s API | ~180 Skippers | 1 Route Server per cluster |
| Skipper HPA ceiling | ~180 | 300 |
| Poll interval | per-Skipper | 3 seconds, centralised |
| etcd overload | threatening | resolved |
| API-server CPU throttle | yes | resolved |
| GMV loss on rollout | — | 0 |
Contrast with other control-plane shapes¶
- xDS / gRPC streaming (Envoy) — push-based; lower median propagation latency but requires xDS server infrastructure and per-client connection state. ETag polling is simpler and stateless on the server.
- Sidecar informer cache — coalesces at the process boundary, not across replicas; doesn't solve the N-replicas fan-out.
- Event bus (Kafka / Redis topic) — viable but adds unrelated infra; the ETag-proxy pattern is a minimal single-process solution when HTTP is already in the stack.
See also¶
- concepts/control-plane-fan-out-to-kubernetes-api — the anti-pattern this pattern addresses.
- concepts/etag-conditional-polling — the wire protocol.
- concepts/last-known-good-routing-table — the data- plane fallback that makes the proxy non-critical.
- concepts/polling-interval-as-freshness-budget — the externally visible cost.
- patterns/three-mode-rollout-off-shadow-exec — how to roll this out safely when it sits on the critical path.
- systems/zalando-route-server — the canonical instance.