Skip to content

CONCEPT Cited by 1 source

Last-known-good routing table

Definition

Last-known-good routing table is the operational policy that a data-plane proxy continues serving traffic using the last routing configuration it successfully loaded even after its control plane becomes unreachable. Stale routes are preferred to no routes; availability beats consistency on the request-serving path.

Canonical instance

When Route Server is unavailable, Skipper falls back to its cached routing table (Source: sources/2025-02-16-zalando-scaling-beyond-limits-harnessing-route-server-for-a-stable-cluster). From the post:

There are 2 possibilities when Route Server is not available: 1. Skipper doesn't have an initial routing table. 2. Skipper has an initial routing table but Route Server is not available to update it. … Skipper will continue to work with the last known routing table. This is a trade-off between availability and consistency.

Two failure-mode asymmetry

The policy is fail-closed at boot, fail-open post-boot:

Timing Skipper flag -wait-first-route-load Behaviour
Cold start, no routing table in memory enabled fail closed — Skipper container does not start
Post-boot, routesrv goes away fail open — keep serving stale routes

Rationale: a Skipper with no routes is not a useful proxy (it 404s every request); a Skipper with yesterday's routes is almost certainly still useful (most routes change rarely, and the routes for the hot URLs haven't changed).

What this trades away

  • Route removals don't propagate. An Ingress deleted while routesrv is down is still served by Skipper until routesrv returns — the stale copy contains the deleted route. For cleanup-driven changes (takedowns, decommissions) this is a correctness gap.
  • Feature flags and filter-chain updates don't propagate. Any route-level change — OAuth scope tightening, new rate limit, new backend URL — is silently withheld.
  • Staleness is unbounded on the clock. No TTL fires. Skipper keeps serving until an operator intervenes.

What it buys

  • No cascading outage. If Route Server fails, ingress traffic keeps flowing; Zalando's 2M rps doesn't drop to zero.
  • Decoupled blast radius. A bug in the control plane (parse error, metric-blocking deadlock, bad OOM) doesn't propagate to the data plane.
  • Graceful operator-in-the-loop recovery. The post describes this plainly: "we get an alert and we decide to either fix the Route Server or disable it and let Skipper work without it." A human chooses; the system doesn't auto-degrade.
  • Stale-while-revalidate in HTTP/CDN caches is the same idea on a shorter clock.
  • Persistent-on-disk service-discovery cache — many service meshes write the last config to disk so a sidecar can come back after a node reboot even if the control plane is slow. routesrv-Skipper doesn't document an on-disk component; the "last known" is in Skipper's process memory.

Unresolved at publish time

  • No automatic fallback to pre-routesrv direct-API polling — called out as future work in the source post.
  • No TTL / max-staleness bound documented — Skipper will hold the stale table indefinitely.

See also

Last updated · 501 distilled / 1,218 read