CONCEPT Cited by 1 source
Cache-control-aware grace period¶
Definition¶
A cache-control-aware grace period is the interval a
publisher of rotating cryptographic key material waits between
publishing a new public key and activating it as the
signing key, chosen so that any client caching the previous
key set (per the HTTP Cache-Control headers on the publisher's
endpoint) has had enough time to refresh and see the new key
before it is used in the wild.
It is the load-bearing knob in the five-phase JWK rotation described by Zalando's customer identity platform (2025-01-20): the whole reason a new key is published-but- not-yet-active is that clients cache JWKS responses, and the publisher must give those caches a full refresh cycle before starting to sign with the new key.
The principle: grace period is measured in cache TTLs, not
wall-clock minutes. If the JWKS endpoint returns
Cache-Control: max-age=3600, the minimum safe grace period is
on the order of 3600 seconds plus slack for clock skew, in-
flight requests, and client-side implementation variance.
Why it's non-trivial¶
From the Zalando article:
"To avoid any immediate disruptions, we incorporate a grace period, allowing clients ample time to fetch the latest set of JWKs — cache control headers matter!" (Source: sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando)
The emphatic "cache control headers matter!" is the tell. Two independent systems govern how long a client trusts a stale JWKS:
- The publisher's
Cache-Control: max-ageon the JWKS response. - The client's own refresh policy — many OIDC libraries add their own minimum refresh interval (often to avoid hammering the IdP on every request).
The grace period must accommodate both plus any CDN or
intermediate proxy caching layered on top. If the JWKS endpoint
sits behind a CDN with its own TTL, effective cache age is
CDN_ttl + client_max_age + client_refresh_min.
The failure mode when grace is too short¶
Consider a rotation where the publisher:
- Publishes the new public key at T=0.
- Waits 5 minutes.
- Starts signing with the new private key at T=5min.
If the JWKS endpoint advertises max-age=3600 (one hour), then
clients that fetched JWKS at T=-55min have a cache that will not
expire until T=5min and will not be refetched until the client
next needs an unknown kid. Any JWT signed in the T=5min →
T=60min window arrives at verifiers with a kid they cannot
resolve, because:
- Their cached JWKS (from T=-55min) lacks the new key.
- Their cache is still within its TTL so they don't refresh.
- The token fails verification; the request returns 401.
The grace period must exceed publisher_max_age +
longest_expected_client_refresh_policy so that every caching
layer has had a full opportunity to refresh before the first
token signed by the new key lands.
The failure mode when grace is too long¶
A grace period longer than necessary is the safer failure mode — the window during which a compromise of the old private key could be used to forge tokens is extended, but only by the excess grace — so in practice the grace is chosen with generous headroom. The Zalando article doesn't quote their specific grace interval; the principle is that it must be larger than the maximum conceivable client cache age including intermediate caching infrastructure.
Design consequences¶
Short cache TTLs accelerate rotation. A publisher that wants
to rotate weekly must set JWKS max-age to much less than one
week, or the grace period alone consumes the whole rotation
cadence. Industry practice is JWKS TTLs in the 5-60 minute
range, which bounds the grace period to low-minutes scale and
allows multi-rotation-per-day cadences if desired.
Falling-back to refetch on unknown kid is an emergency
hatch, not a design point. Most OIDC libraries treat an
unknown kid as a trigger to refetch JWKS once before giving
up. This makes hard grace-period violations partially
recoverable (the next request succeeds after the refetch) but
leaves the first request broken and adds load to the JWKS
endpoint. The cache-control-aware grace period is what ensures
the unknown-kid path is exceptional rather than normal.
Trade-off: rotation cadence vs. JWKS endpoint load. Shorter
JWKS TTLs → faster permissible rotation → more JWKS fetches per
client per day. Zalando's JWKS endpoint at
accounts.zalando.com/.well-known/jwk_uris is a high-traffic
endpoint precisely because it governs fleet-wide trust; CDN
fronting is the standard answer to amortise the resulting load
(at the cost of another cache layer that must be accounted for
in the grace period).
Generalisation¶
The principle generalises to any rotation where a publisher cannot coordinate with consumers:
- TLS certificate rotation — publish the new cert well before the old one expires; the overlap window plays the role of the grace period.
- Config fanout ([[concepts/s3-signal-bucket-as-config- fanout]]) — new config is visible in the storage layer before any service consumes it; the "visible window" must exceed the poll interval.
- DNS rotation — new A record must be in DNS caches before the old record's TTL expires.
Each case has the same structural shape: publisher propagation latency must be bounded and must be less than the grace period chosen for the activation step.
Seen in¶
- systems/zalando-oidc-identity-provider — the 2025-01-20 article names "cache control headers" as load-bearing in the rotation lifecycle. Canonical JWK-rotation instance of this principle.
See also¶
- concepts/signing-key-rotation-lifecycle — the six-phase sequence this grace period sits between phases 2 and 4.
- concepts/jwk-json-web-key — the JWKS surface whose HTTP caching semantics define the grace-period lower bound.
- concepts/retirement-plus-lifespan-plus-buffer-formula — the symmetric gate on the other side of the lifecycle (when to drop a retired key).
- patterns/phased-automated-jwk-rotation — the automated implementation that encodes this grace period as a fixed delay in the rotation state machine.