CONCEPT Cited by 1 source
Last-known-good routing table¶
Definition¶
Last-known-good routing table is the operational policy that a data-plane proxy continues serving traffic using the last routing configuration it successfully loaded even after its control plane becomes unreachable. Stale routes are preferred to no routes; availability beats consistency on the request-serving path.
Canonical instance¶
When Route Server is unavailable, Skipper falls back to its cached routing table (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). From the post:
There are 2 possibilities when Route Server is not available: 1. Skipper doesn't have an initial routing table. 2. Skipper has an initial routing table but Route Server is not available to update it. … Skipper will continue to work with the last known routing table. This is a trade-off between availability and consistency.
Two failure-mode asymmetry¶
The policy is fail-closed at boot, fail-open post-boot:
| Timing | Skipper flag -wait-first-route-load |
Behaviour |
|---|---|---|
| Cold start, no routing table in memory | enabled | fail closed — Skipper container does not start |
| Post-boot, routesrv goes away | — | fail open — keep serving stale routes |
Rationale: a Skipper with no routes is not a useful proxy (it 404s every request); a Skipper with yesterday's routes is almost certainly still useful (most routes change rarely, and the routes for the hot URLs haven't changed).
What this trades away¶
- Route removals don't propagate. An
Ingressdeleted while routesrv is down is still served by Skipper until routesrv returns — the stale copy contains the deleted route. For cleanup-driven changes (takedowns, decommissions) this is a correctness gap. - Feature flags and filter-chain updates don't propagate. Any route-level change — OAuth scope tightening, new rate limit, new backend URL — is silently withheld.
- Staleness is unbounded on the clock. No TTL fires. Skipper keeps serving until an operator intervenes.
What it buys¶
- No cascading outage. If Route Server fails, ingress traffic keeps flowing; Zalando's 2M rps doesn't drop to zero.
- Decoupled blast radius. A bug in the control plane (parse error, metric-blocking deadlock, bad OOM) doesn't propagate to the data plane.
- Graceful operator-in-the-loop recovery. The post describes this plainly: "we get an alert and we decide to either fix the Route Server or disable it and let Skipper work without it." A human chooses; the system doesn't auto-degrade.
Related patterns¶
- Stale-while-revalidate in HTTP/CDN caches is the same idea on a shorter clock.
- Persistent-on-disk service-discovery cache — many service meshes write the last config to disk so a sidecar can come back after a node reboot even if the control plane is slow. routesrv-Skipper doesn't document an on-disk component; the "last known" is in Skipper's process memory.
Unresolved at publish time¶
- No automatic fallback to pre-routesrv direct-API polling — called out as future work in the source post.
- No TTL / max-staleness bound documented — Skipper will hold the stale table indefinitely.
See also¶
- concepts/etag-conditional-polling — the protocol that keeps the table current when the control plane is reachable.
- patterns/control-plane-proxy-with-etag-cache — the architectural pattern this policy is half of (the other half being the ETag coalescer).
- systems/skipper-proxy · systems/zalando-route-server