Skip to content

PATTERN Cited by 1 source

Control-plane proxy with ETag cache

Intent

Decouple a large data-plane fleet from a shared upstream config source (Kubernetes API, service registry, auth server, any config store) by inserting a single coalescing proxy between them. The proxy polls or watches the upstream at one cadence and serves downstream pods over HTTP using ETag conditional polling. Converts an N× fan-out on the shared upstream into a 1× poll + N× 304-gated delta channel on a cheap HTTP tier.

Structure

                                           ┌────────────────┐
                                           │ Data-plane pod │
                                           │   (Skipper)    │
                                           └───────▲────────┘
                                                   │ HTTP + ETag
                                                   │ (every Δ)
  ┌──────────────┐     poll @ Δ     ┌──────────────┴─────┐
  │ Upstream     │ ◄─────────────── │ Coalescing proxy   │
  │ (K8s API,    │                  │ (Route Server)     │ ◄── ... N more clients
  │  etcd, ...)  │                  │ - parse + compile  │
  └──────────────┘                  │ - compute ETag     │
                                    │ - serve /routes    │
                                    └────────────────────┘

Exactly one connection against the upstream; N cheap HTTP connections against the proxy.

Canonical instance

Zalando's Route Server (routesrv) (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster):

  • Upstream: Kubernetes API (Ingress + RouteGroup resources).
  • Proxy: Route Server — polls the API every 3 seconds, parses into Eskip, computes a table-wide ETag, serves GET /routes with HTTP 200 or 304.
  • Data plane: ~300 Skipper pods per cluster.
  • Result: ~180-Skipper load on etcd (enough to break pod scheduling) → 1× polling on etcd + 100 rps of cheap 304s from Route Server to Skipper. HPA ceiling raised from 180 to 300 pods with zero downtime, zero GMV loss.

When to apply

  • Data-plane replica count is high (hundreds+) and each replica needs roughly the same config.
  • The config source is a shared scaling bottleneck — typical examples: Kubernetes API + etcd, a database, an auth server, a config repo you'd otherwise poll.
  • Change volume is low relative to poll rate (so most polls return 304 — ETag caching pays off).
  • Kubernetes Informers are not a sufficient answer — they preserve the N× fan-out against the API server at change events (Zalando's reasoning for rejecting them: informer push still requires the API server to fan out to all N clients, which under a burst reproduces the same thundering herd on etcd + API server).

Trade-offs

  • Freshness floor equals poll interval. Operators upstream of the proxy can't get sub-interval propagation to data- plane pods — see concepts/polling-interval-as-freshness-budget.
  • The proxy is a new SPOF. Mitigate with last-known-good fallback on the data plane so the proxy being down is a freshness regression, not a traffic outage.
  • ETag granularity is a real design choice. One ETag per whole table (Zalando's choice) means every change triggers a full refetch by all clients. Per-resource ETags narrow that at the cost of server complexity.
  • Rollout risk. Inserting a new control-plane tier on the critical path is high-stakes — pair with three-mode rollout to diff old vs new outputs before cutover.

Operational numbers (Zalando)

Dimension Before After
Who polls K8s API ~180 Skippers 1 Route Server per cluster
Skipper HPA ceiling ~180 300
Poll interval per-Skipper 3 seconds, centralised
etcd overload threatening resolved
API-server CPU throttle yes resolved
GMV loss on rollout 0

Contrast with other control-plane shapes

  • xDS / gRPC streaming (Envoy) — push-based; lower median propagation latency but requires xDS server infrastructure and per-client connection state. ETag polling is simpler and stateless on the server.
  • Sidecar informer cache — coalesces at the process boundary, not across replicas; doesn't solve the N-replicas fan-out.
  • Event bus (Kafka / Redis topic) — viable but adds unrelated infra; the ETag-proxy pattern is a minimal single-process solution when HTTP is already in the stack.

See also

Last updated · 501 distilled / 1,218 read