PATTERN Cited by 1 source
Phased automated JWK rotation¶
Pattern¶
Run the six-phase signing-key rotation lifecycle as a scheduled, fully automated loop over a JWKS endpoint, so that planned rotations are invisible to downstream verifiers and never require human coordination with client fleets.
The pattern combines four discrete principles — automation, scheduling, secure key storage, seamless transition — into a single repeatable rotation primitive owned by the identity provider (IdP) and expressed as a state machine over:
- the set of keys currently in each lifecycle phase (generate / publish / grace / activate / retire / drop);
- the JWKS endpoint that publishes the public half of every key in phases publish through retire;
- the two hard-gate timers (grace period and retirement-plus-lifespan-plus-buffer).
Canonical wiki instance: Zalando's customer-identity OIDC IdP
at accounts.zalando.com/.well-known/jwk_uris, described in
the 2025-01-20 article (Source:
sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando).
Problem¶
Long-lived cryptographic signing keys are ticking time bombs (concepts/long-lived-key-risk). If the private half of a JWT signing key is compromised, every token signed by it becomes forgeable, and every token ever signed by it becomes untrustworthy until it expires. Rotation shrinks the exposure window, but rotation is traditionally hard for three reasons:
- Key distribution. Historically rotating a public key required coordinated out-of-band distribution to every client — slow, error-prone, and often skipped.
- Ordering. Naïve rotation (swap old for new atomically)
breaks verifiers whose cached key set lags the switch; the
next request with the new
kidfails signature verification and returns 401. - Human intervention cost. Manual rotation ceremonies are rare by design (ceremony overhead) but that rarity extends the compromise-exposure window — the exact property rotation was meant to shrink.
The pattern solves all three by making rotation a continuous, self-driving background process keyed on calendar cadence, not on incident-response triggers.
Shape¶
Inputs (IdP-controlled, configurable)¶
rotation_cadence— how often a new key is generated (e.g. weekly, monthly).jwks_max_age— HTTPCache-Control: max-ageadvertised on JWKS responses.grace_period— time between publishing a new key and activating it; must satisfygrace_period ≥ jwks_max_age + downstream_cache_layers + client_refresh_min_policy.max_token_lifespan— longestexp - iatthe IdP issues for any JWT (access tokens, refresh tokens, any class).safety_buffer— slack added to the drop-time formula for clock skew, in-flight requests, implementation variance.
State machine per key¶
For each key K generated by the loop:
| Phase | Private key | JWKS entry | Signs | Verifies | Transition trigger |
|---|---|---|---|---|---|
| Generate | created | absent | no | no | rotation schedule fires |
| Publish | exists | present | no | no | immediate after generate |
| Grace | exists | present | no | no | grace_period elapsed |
| Activate | exists | present | yes | yes | next rotation schedule |
| Retire | exists | present | no | yes | successor activated |
| Drop | destroyed | absent | no | no | retirement_time + max_token_lifespan + safety_buffer elapsed |
At any moment, the JWKS endpoint advertises the publish + grace + active + retired keys — typically 3-5 keys in steady-state depending on cadence and retention.
Loop (per rotation schedule)¶
every <rotation_cadence>:
K_new = generate_key_pair()
publish(K_new) # phase 2
wait(grace_period) # phase 3
K_old_active = current_active()
activate(K_new) # phase 4
retire(K_old_active) # phase 5
schedule_drop(
K_old_active,
at = now + max_token_lifespan + safety_buffer,
) # phase 6 (deferred)
Invariants preserved¶
- Every
kida verifier sees in a token was in the JWKS before that token was signed (enforced by publish → grace → activate ordering). - Every
kidused to sign a still-valid token is still in the JWKS (enforced by retire-to-drop deferral via the formula).
Any compression of the sequence violates one of these; see concepts/signing-key-rotation-lifecycle#why-this-ordering-is-non-negotiable for the full proof.
Context — when to use¶
- Production OIDC / JWT identity providers with a verifier fleet you cannot directly coordinate with.
- Any signed-artifact system with a publish-then-verify model and cache-based public-key distribution (TLS cert rotation in a PKI, code-signing key rotation, SAML IdP certificate rotation — same structural shape).
- Federation trust anchors where manual rotation ceremonies would be rare enough to extend the compromise-exposure window beyond acceptable.
Context — when NOT to use¶
- Emergency rotation after a key-compromise incident. The pattern is a preventive control; compromise requires immediate revocation with accepted token-invalidation. Do not try to shoe-horn emergency rotation into this scheduled loop — use a separate revocation path (kill the retire-to- drop timer, drop the key immediately, accept 401s).
- Very short-lived systems where the IdP itself lives less than a rotation cycle. The loop's asymptotic benefit only manifests across many cycles.
- Hardware-root-of-trust keys where physical ceremony is the security property (root CA HSMs, bootloader firmware signing keys). The pattern's automation principle is actively undesirable at that layer.
Consequences¶
Good¶
- Seamless client experience on planned rotations — zero client outage, zero retry storms, zero cached-key staleness exposure.
- Shrinks compromise-exposure window to one rotation cadence — the longer a compromised key can be used unnoticed, the larger the blast radius; cadence directly bounds this.
- Operationally cheap — once the loop is running, a rotation is a cron execution, not an on-call event.
- Composable with short token lifespans — short
exp - iatdirectly shortens the retained-key-drop window, which reduces steady-state JWKS cardinality.
Bad¶
- Grace-period cost. Every rotation consumes
grace_periodof wall-clock before the new key becomes active. This bounds rotation cadence from below: you cannot rotate more often thangrace_period + εwithout the new-key-not-yet- active windows overlapping. - JWKS steady-state cardinality grows with retention.
rotation_cadence×retention_window≈ JWKS key count. Small for most configurations (3-5 keys) but needs attention for very-high-cadence or very-long-lifespan deployments. - Cache-layer opacity. Any caching layer the IdP operator
didn't account for (a new CDN, a new proxy, a new client
library with its own refresh policy) silently extends the
required grace period. Underestimating grace → unknown-
kid401s in the wild. max_token_lifespanis a structural knob. Long-lived refresh tokens directly extend how long retired keys must stay published. Teams that issue 30-day refresh tokens live with 30-day retention obligations per rotation.
Neutral¶
- Requires
kid-in-JWT-header. Without it, the drop-time formula isn't computable and the pattern degrades to measurement-based heuristics. Fortunately this is standard JWT practice. - Private-key storage surface expands. Several keys in different lifecycle phases coexist at any moment; they all need secure storage. The Zalando post defers this to "industry best practices" — HSM, KMS, split-custody are all valid implementations.
Tradeoffs¶
- Rotation cadence vs JWKS endpoint load. Shorter cadence
requires shorter
jwks_max_age(because grace must fit inside cadence), which means more JWKS fetches per client per day, which means more endpoint load. CDN fronting amortises this but adds another cache layer that must be accounted for in grace. - Grace period vs compromise-exposure window. A longer grace period is always safe from a correctness perspective but extends the window during which a compromised old private key could still be used to forge tokens. In practice grace is chosen with generous headroom because the correctness-cost of too-short is acute (401s) while the security-cost of too-long is marginal (compromise is rare and the cadence itself is the primary control).
- Single-active-key vs multi-active-key models. The Zalando pattern is single-active-key (clean lifecycle, simpler reasoning). Multi-active-key variants support deployments where not all signing instances have picked up the new key simultaneously but require the lifecycle to overlap across generations.
Related patterns and concepts¶
- concepts/signing-key-rotation-lifecycle — the phase-level state machine this pattern automates.
- concepts/jwk-json-web-key — the key-distribution substrate this pattern operates over.
- concepts/cache-control-aware-grace-period — how the
grace_periodparameter gets its floor. - concepts/retirement-plus-lifespan-plus-buffer-formula — the arithmetic that schedules phase-6 drops.
- concepts/long-lived-key-risk — the risk framing that motivates making rotation structural rather than ad hoc.
- concepts/oidc-identity-federation — the OIDC surface where this pattern is most commonly instantiated.
- systems/zalando-oidc-identity-provider — canonical wiki instance.
Seen in¶
- sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando — Zalando's Customer Authentication Experience team canonicalises the pattern in public prose, with the four principles (automation / scheduled / secure / seamless) named explicitly and the six-phase lifecycle + drop-time formula described verbatim.
Generalisation¶
The publish → grace → activate → retire → deferred-drop shape generalises to any rotating public artefact where the publisher cannot coordinate with consumers:
- TLS certificate rotation — publish new cert before old cert expires; overlap window = grace + retention.
- DNS rotation — new record in DNS caches before old record's TTL expires.
- Feature-flag deprecations — new flag value rolled out before old code path removed; retention = time until every client has upgraded past the flag reference.
- Schema versioning — new schema version visible to readers before writers flip; retention = time until every stored artefact has been re-written to new schema.
Each case shares the same structural invariant: publisher propagation latency must be bounded and must be less than the grace period chosen for the activation step.