CONCEPT Cited by 1 source

Last-known-good routing table¶

Definition¶

Last-known-good routing table is the operational policy that a data-plane proxy continues serving traffic using the last routing configuration it successfully loaded even after its control plane becomes unreachable. Stale routes are preferred to no routes; availability beats consistency on the request-serving path.

Canonical instance¶

When Route Server is unavailable, Skipper falls back to its cached routing table (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). From the post:

There are 2 possibilities when Route Server is not available: 1. Skipper doesn't have an initial routing table. 2. Skipper has an initial routing table but Route Server is not available to update it. … Skipper will continue to work with the last known routing table. This is a trade-off between availability and consistency.

Two failure-mode asymmetry¶

The policy is fail-closed at boot, fail-open post-boot:

Timing	Skipper flag `-wait-first-route-load`	Behaviour
Cold start, no routing table in memory	enabled	fail closed — Skipper container does not start
Post-boot, routesrv goes away	—	fail open — keep serving stale routes

Rationale: a Skipper with no routes is not a useful proxy (it 404s every request); a Skipper with yesterday's routes is almost certainly still useful (most routes change rarely, and the routes for the hot URLs haven't changed).

What this trades away¶

Route removals don't propagate. An Ingress deleted while routesrv is down is still served by Skipper until routesrv returns — the stale copy contains the deleted route. For cleanup-driven changes (takedowns, decommissions) this is a correctness gap.
Feature flags and filter-chain updates don't propagate. Any route-level change — OAuth scope tightening, new rate limit, new backend URL — is silently withheld.
Staleness is unbounded on the clock. No TTL fires. Skipper keeps serving until an operator intervenes.

What it buys¶

No cascading outage. If Route Server fails, ingress traffic keeps flowing; Zalando's 2M rps doesn't drop to zero.
Decoupled blast radius. A bug in the control plane (parse error, metric-blocking deadlock, bad OOM) doesn't propagate to the data plane.
Graceful operator-in-the-loop recovery. The post describes this plainly: "we get an alert and we decide to either fix the Route Server or disable it and let Skipper work without it." A human chooses; the system doesn't auto-degrade.

Stale-while-revalidate in HTTP/CDN caches is the same idea on a shorter clock.
Persistent-on-disk service-discovery cache — many service meshes write the last config to disk so a sidecar can come back after a node reboot even if the control plane is slow. routesrv-Skipper doesn't document an on-disk component; the "last known" is in Skipper's process memory.

Unresolved at publish time¶

No automatic fallback to pre-routesrv direct-API polling — called out as future work in the source post.
No TTL / max-staleness bound documented — Skipper will hold the stale table indefinitely.