Skip to content

PATTERN Cited by 1 source

Phased automated JWK rotation

Pattern

Run the six-phase signing-key rotation lifecycle as a scheduled, fully automated loop over a JWKS endpoint, so that planned rotations are invisible to downstream verifiers and never require human coordination with client fleets.

The pattern combines four discrete principles — automation, scheduling, secure key storage, seamless transition — into a single repeatable rotation primitive owned by the identity provider (IdP) and expressed as a state machine over:

  • the set of keys currently in each lifecycle phase (generate / publish / grace / activate / retire / drop);
  • the JWKS endpoint that publishes the public half of every key in phases publish through retire;
  • the two hard-gate timers (grace period and retirement-plus-lifespan-plus-buffer).

Canonical wiki instance: Zalando's customer-identity OIDC IdP at accounts.zalando.com/.well-known/jwk_uris, described in the 2025-01-20 article (Source: sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando).

Problem

Long-lived cryptographic signing keys are ticking time bombs (concepts/long-lived-key-risk). If the private half of a JWT signing key is compromised, every token signed by it becomes forgeable, and every token ever signed by it becomes untrustworthy until it expires. Rotation shrinks the exposure window, but rotation is traditionally hard for three reasons:

  1. Key distribution. Historically rotating a public key required coordinated out-of-band distribution to every client — slow, error-prone, and often skipped.
  2. Ordering. Naïve rotation (swap old for new atomically) breaks verifiers whose cached key set lags the switch; the next request with the new kid fails signature verification and returns 401.
  3. Human intervention cost. Manual rotation ceremonies are rare by design (ceremony overhead) but that rarity extends the compromise-exposure window — the exact property rotation was meant to shrink.

The pattern solves all three by making rotation a continuous, self-driving background process keyed on calendar cadence, not on incident-response triggers.

Shape

Inputs (IdP-controlled, configurable)

  • rotation_cadence — how often a new key is generated (e.g. weekly, monthly).
  • jwks_max_age — HTTP Cache-Control: max-age advertised on JWKS responses.
  • grace_period — time between publishing a new key and activating it; must satisfy grace_period ≥ jwks_max_age + downstream_cache_layers + client_refresh_min_policy.
  • max_token_lifespan — longest exp - iat the IdP issues for any JWT (access tokens, refresh tokens, any class).
  • safety_buffer — slack added to the drop-time formula for clock skew, in-flight requests, implementation variance.

State machine per key

For each key K generated by the loop:

Phase Private key JWKS entry Signs Verifies Transition trigger
Generate created absent no no rotation schedule fires
Publish exists present no no immediate after generate
Grace exists present no no grace_period elapsed
Activate exists present yes yes next rotation schedule
Retire exists present no yes successor activated
Drop destroyed absent no no retirement_time + max_token_lifespan + safety_buffer elapsed

At any moment, the JWKS endpoint advertises the publish + grace + active + retired keys — typically 3-5 keys in steady-state depending on cadence and retention.

Loop (per rotation schedule)

every <rotation_cadence>:
    K_new = generate_key_pair()
    publish(K_new)                      # phase 2
    wait(grace_period)                  # phase 3
    K_old_active = current_active()
    activate(K_new)                     # phase 4
    retire(K_old_active)                # phase 5
    schedule_drop(
        K_old_active,
        at = now + max_token_lifespan + safety_buffer,
    )                                    # phase 6 (deferred)

Invariants preserved

  1. Every kid a verifier sees in a token was in the JWKS before that token was signed (enforced by publish → grace → activate ordering).
  2. Every kid used to sign a still-valid token is still in the JWKS (enforced by retire-to-drop deferral via the formula).

Any compression of the sequence violates one of these; see concepts/signing-key-rotation-lifecycle#why-this-ordering-is-non-negotiable for the full proof.

Context — when to use

  • Production OIDC / JWT identity providers with a verifier fleet you cannot directly coordinate with.
  • Any signed-artifact system with a publish-then-verify model and cache-based public-key distribution (TLS cert rotation in a PKI, code-signing key rotation, SAML IdP certificate rotation — same structural shape).
  • Federation trust anchors where manual rotation ceremonies would be rare enough to extend the compromise-exposure window beyond acceptable.

Context — when NOT to use

  • Emergency rotation after a key-compromise incident. The pattern is a preventive control; compromise requires immediate revocation with accepted token-invalidation. Do not try to shoe-horn emergency rotation into this scheduled loop — use a separate revocation path (kill the retire-to- drop timer, drop the key immediately, accept 401s).
  • Very short-lived systems where the IdP itself lives less than a rotation cycle. The loop's asymptotic benefit only manifests across many cycles.
  • Hardware-root-of-trust keys where physical ceremony is the security property (root CA HSMs, bootloader firmware signing keys). The pattern's automation principle is actively undesirable at that layer.

Consequences

Good

  • Seamless client experience on planned rotations — zero client outage, zero retry storms, zero cached-key staleness exposure.
  • Shrinks compromise-exposure window to one rotation cadence — the longer a compromised key can be used unnoticed, the larger the blast radius; cadence directly bounds this.
  • Operationally cheap — once the loop is running, a rotation is a cron execution, not an on-call event.
  • Composable with short token lifespans — short exp - iat directly shortens the retained-key-drop window, which reduces steady-state JWKS cardinality.

Bad

  • Grace-period cost. Every rotation consumes grace_period of wall-clock before the new key becomes active. This bounds rotation cadence from below: you cannot rotate more often than grace_period + ε without the new-key-not-yet- active windows overlapping.
  • JWKS steady-state cardinality grows with retention. rotation_cadence × retention_window ≈ JWKS key count. Small for most configurations (3-5 keys) but needs attention for very-high-cadence or very-long-lifespan deployments.
  • Cache-layer opacity. Any caching layer the IdP operator didn't account for (a new CDN, a new proxy, a new client library with its own refresh policy) silently extends the required grace period. Underestimating grace → unknown- kid 401s in the wild.
  • max_token_lifespan is a structural knob. Long-lived refresh tokens directly extend how long retired keys must stay published. Teams that issue 30-day refresh tokens live with 30-day retention obligations per rotation.

Neutral

  • Requires kid-in-JWT-header. Without it, the drop-time formula isn't computable and the pattern degrades to measurement-based heuristics. Fortunately this is standard JWT practice.
  • Private-key storage surface expands. Several keys in different lifecycle phases coexist at any moment; they all need secure storage. The Zalando post defers this to "industry best practices" — HSM, KMS, split-custody are all valid implementations.

Tradeoffs

  • Rotation cadence vs JWKS endpoint load. Shorter cadence requires shorter jwks_max_age (because grace must fit inside cadence), which means more JWKS fetches per client per day, which means more endpoint load. CDN fronting amortises this but adds another cache layer that must be accounted for in grace.
  • Grace period vs compromise-exposure window. A longer grace period is always safe from a correctness perspective but extends the window during which a compromised old private key could still be used to forge tokens. In practice grace is chosen with generous headroom because the correctness-cost of too-short is acute (401s) while the security-cost of too-long is marginal (compromise is rare and the cadence itself is the primary control).
  • Single-active-key vs multi-active-key models. The Zalando pattern is single-active-key (clean lifecycle, simpler reasoning). Multi-active-key variants support deployments where not all signing instances have picked up the new key simultaneously but require the lifecycle to overlap across generations.

Seen in

Generalisation

The publish → grace → activate → retire → deferred-drop shape generalises to any rotating public artefact where the publisher cannot coordinate with consumers:

  • TLS certificate rotation — publish new cert before old cert expires; overlap window = grace + retention.
  • DNS rotation — new record in DNS caches before old record's TTL expires.
  • Feature-flag deprecations — new flag value rolled out before old code path removed; retention = time until every client has upgraded past the flag reference.
  • Schema versioning — new schema version visible to readers before writers flip; retention = time until every stored artefact has been re-written to new schema.

Each case shares the same structural invariant: publisher propagation latency must be bounded and must be less than the grace period chosen for the activation step.

Last updated · 550 distilled / 1,221 read