Skip to content

CONCEPT Cited by 1 source

Cache-control-aware grace period

Definition

A cache-control-aware grace period is the interval a publisher of rotating cryptographic key material waits between publishing a new public key and activating it as the signing key, chosen so that any client caching the previous key set (per the HTTP Cache-Control headers on the publisher's endpoint) has had enough time to refresh and see the new key before it is used in the wild.

It is the load-bearing knob in the five-phase JWK rotation described by Zalando's customer identity platform (2025-01-20): the whole reason a new key is published-but- not-yet-active is that clients cache JWKS responses, and the publisher must give those caches a full refresh cycle before starting to sign with the new key.

The principle: grace period is measured in cache TTLs, not wall-clock minutes. If the JWKS endpoint returns Cache-Control: max-age=3600, the minimum safe grace period is on the order of 3600 seconds plus slack for clock skew, in- flight requests, and client-side implementation variance.

Why it's non-trivial

From the Zalando article:

"To avoid any immediate disruptions, we incorporate a grace period, allowing clients ample time to fetch the latest set of JWKs — cache control headers matter!" (Source: sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando)

The emphatic "cache control headers matter!" is the tell. Two independent systems govern how long a client trusts a stale JWKS:

  1. The publisher's Cache-Control: max-age on the JWKS response.
  2. The client's own refresh policy — many OIDC libraries add their own minimum refresh interval (often to avoid hammering the IdP on every request).

The grace period must accommodate both plus any CDN or intermediate proxy caching layered on top. If the JWKS endpoint sits behind a CDN with its own TTL, effective cache age is CDN_ttl + client_max_age + client_refresh_min.

The failure mode when grace is too short

Consider a rotation where the publisher:

  1. Publishes the new public key at T=0.
  2. Waits 5 minutes.
  3. Starts signing with the new private key at T=5min.

If the JWKS endpoint advertises max-age=3600 (one hour), then clients that fetched JWKS at T=-55min have a cache that will not expire until T=5min and will not be refetched until the client next needs an unknown kid. Any JWT signed in the T=5min → T=60min window arrives at verifiers with a kid they cannot resolve, because:

  • Their cached JWKS (from T=-55min) lacks the new key.
  • Their cache is still within its TTL so they don't refresh.
  • The token fails verification; the request returns 401.

The grace period must exceed publisher_max_age + longest_expected_client_refresh_policy so that every caching layer has had a full opportunity to refresh before the first token signed by the new key lands.

The failure mode when grace is too long

A grace period longer than necessary is the safer failure mode — the window during which a compromise of the old private key could be used to forge tokens is extended, but only by the excess grace — so in practice the grace is chosen with generous headroom. The Zalando article doesn't quote their specific grace interval; the principle is that it must be larger than the maximum conceivable client cache age including intermediate caching infrastructure.

Design consequences

Short cache TTLs accelerate rotation. A publisher that wants to rotate weekly must set JWKS max-age to much less than one week, or the grace period alone consumes the whole rotation cadence. Industry practice is JWKS TTLs in the 5-60 minute range, which bounds the grace period to low-minutes scale and allows multi-rotation-per-day cadences if desired.

Falling-back to refetch on unknown kid is an emergency hatch, not a design point. Most OIDC libraries treat an unknown kid as a trigger to refetch JWKS once before giving up. This makes hard grace-period violations partially recoverable (the next request succeeds after the refetch) but leaves the first request broken and adds load to the JWKS endpoint. The cache-control-aware grace period is what ensures the unknown-kid path is exceptional rather than normal.

Trade-off: rotation cadence vs. JWKS endpoint load. Shorter JWKS TTLs → faster permissible rotation → more JWKS fetches per client per day. Zalando's JWKS endpoint at accounts.zalando.com/.well-known/jwk_uris is a high-traffic endpoint precisely because it governs fleet-wide trust; CDN fronting is the standard answer to amortise the resulting load (at the cost of another cache layer that must be accounted for in the grace period).

Generalisation

The principle generalises to any rotation where a publisher cannot coordinate with consumers:

  • TLS certificate rotation — publish the new cert well before the old one expires; the overlap window plays the role of the grace period.
  • Config fanout ([[concepts/s3-signal-bucket-as-config- fanout]]) — new config is visible in the storage layer before any service consumes it; the "visible window" must exceed the poll interval.
  • DNS rotation — new A record must be in DNS caches before the old record's TTL expires.

Each case has the same structural shape: publisher propagation latency must be bounded and must be less than the grace period chosen for the activation step.

Seen in

  • systems/zalando-oidc-identity-provider — the 2025-01-20 article names "cache control headers" as load-bearing in the rotation lifecycle. Canonical JWK-rotation instance of this principle.

See also

Last updated · 501 distilled / 1,218 read