Skip to content

ZALANDO

Read original ↗

Zalando — JSON Web Keys (JWK): Rotating Cryptographic Keys at Zalando

Summary

Zalando's Customer Authentication Experience team describes how their OpenID Connect (OIDC) identity provider (IdP) rotates its JWT signing keys automatically via its JWKS endpoint at accounts.zalando.com/.well-known/jwk_uris. The post is a short, mechanism-level canonicalisation of the generate → publish → grace → activate → retire → drop lifecycle for signing keys, including the verbatim drop_time = retirement_time + max_token_lifespan + safety_buffer formula, and the explicit emphasis that cache-control headers are the load-bearing knob that makes the transition invisible to clients. Four design principles are named: automation, scheduled rotation, secure key management, and seamless rotation (zero client impact on planned rotations). The post's canonical value is not scale numbers (none disclosed) but the ordering discipline: the six-phase lifecycle + two hard gates (grace before activation, lifespan+buffer before drop) is the minimum viable shape for rotating a long-lived federation trust anchor without breaking verifiers.

Key takeaways

  1. Static secrets are the failure mode being designed out. "Static secrets are evil. Whether secret keys hard-coded in source code, tokens without expiry or plaintext API keys referenced in configuration files, static secrets are ticking time bombs. The same is true for cryptographic key material in the context of JSON Web Tokens (JWTs) and OpenID Connect (OIDC)." The IdP's signing key is the most load-bearing long-lived key in the entire customer-auth architecture — if its private part leaks, "anyone could forge fake tokens … all tokens signed with the leaked key would become untrustworthy." Rotation is the structural defence (Source: sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando). Canonical instance of concepts/long-lived-key-risk applied to OIDC IdP signing keys (tier-4 in the priority ladder: federation trust anchors).

  2. JWK is the key-distribution web primitive that makes rotation cheap. "Identity providers (IdPs) like ours commonly use JWKs to distribute public key material via well-known and specified URIs. Clients can use the key material to e.g. verify digitally signed JSON Web Tokens (JWTs) issued by the IdP." JWK (RFC 7517) is part of the JOSE family. Without JWK's web-native JSON format and kid- indexed set, every rotation would require coordinated distribution of new public keys across every client — which is the historical PITA that makes most teams skip rotation entirely. See concepts/jwk-json-web-key.

  3. The four principles the rotation system rests on are named explicitly: "Automation: New keys are generated and old keys are retired automatically, eliminating manual intervention and ensuring consistency. Scheduled Rotation: Keys are rotated on a regular basis to minimize the window of vulnerability. Secure Key Management: Our keys are securely stored and managed using industry best practices to protect them from unauthorized access. Seamless Rotation: Planned rotations are transparent to clients and do not result in any kind of access revocation or token invalidation." Automation + Seamless are the two the lifecycle mechanism operationalises; Scheduled + Secure are the operational context. Canonicalised as patterns/phased-automated-jwk-rotation.

  4. The six-phase rotation lifecycle, verbatim: "First, a new key pair is generated. We then publish the public key portion of this new pair on our JWK endpoint, making it available to our clients. To avoid any immediate disruptions, we incorporate a grace period, allowing clients ample time to fetch the latest set of JWKs – cache control headers matter! After this period, the new key is being elected as the new active signing key. The previous active key is being retired, meaning it's no longer used for signing new tokens, but its public key remains available on the JWK endpoint to ensure that previously issued tokens can still be verified. Finally, once a retired key surpasses the maximum lifetime of any token it might have signed, we remove its public key from the JWK endpoint." This is the canonical public prose description of what the wiki canonicalises as generate → publish → grace → activate → retire → drop. See concept page for full state-machine analysis and both hard gates.

  5. "Cache control headers matter!" — the grace period is measured in cache TTLs, not clock-time. The emphatic tell points at the load-bearing knob: JWKS responses are cached by clients (and often by CDNs, OIDC libraries with their own minimum-refresh policy, and intermediate proxies). The grace period before activating the new key must exceed the publisher's Cache-Control: max-age plus any downstream cache layer plus client-library refresh minimums. If the grace is too short, a JWT signed with the new key arrives at verifiers whose cached JWKS still lacks the new kid → 401. Canonicalised as concepts/cache-control-aware-grace-period.

  6. The drop-time formula is a pure arithmetic function of IdP-controlled knobs: "We simply take the time the key was retired, add the maximum token lifespan, and add a little extra time just to be safe. At that point, any token signed with that key will have expired, so it's safe to remove the key from our public list." Because the IdP sets exp - iat on issuance and every JWT carries a kid, "when is it safe to drop retired key K?" is computable at retirement time without polling verifiers or measuring token usage. Two design choices — kid-in-header + IdP- controlled lifespan — are what make the formula a calculation, not a measurement. Canonicalised as concepts/retirement-plus-lifespan-plus-buffer-formula. Design consequence: short token lifespans (access tokens minutes-to-hours vs refresh tokens days-to-weeks) directly shorten the retention obligation for retired keys, which is why mature IdPs keep access-token lifespans low.

  7. Why the ordering is non-negotiable. Compressing the sequence breaks verifiers in predictable ways: skip publish/grace and verifiers see an unknown kid; skip retire and tokens signed in the last window suddenly fail verification even though they haven't expired. The lifecycle preserves two verifier-facing invariants: (a) every kid in a token was in the JWKS before the token was signed, and (b) every kid still valid at a verifier is still in the JWKS. Both invariants are preserved by the six-phase ordering; neither is preserved by any compression. See concepts/signing-key-rotation-lifecycle for full analysis of invariants and compression-failure modes.

Systems and concepts surfaced

Systems

Concepts

Patterns

  • patterns/phased-automated-jwk-rotation — the automated system-level pattern that encodes the lifecycle as a scheduled loop over the JWKS endpoint; rolls up the four principles (automation / scheduled / secure / seamless) into a single repeatable rotation primitive.

Operational numbers

The post is pedagogy-altitude and discloses no operational numbers:

  • No JWKS cache-control max-age value.
  • No rotation cadence (daily? weekly? monthly?).
  • No absolute grace-period duration.
  • No max-token-lifespan value or safety buffer length.
  • No fleet-size / rps framing for the JWKS endpoint.
  • No per-rotation key-count ceiling (expected steady-state JWKS cardinality).

This is consistent with Zalando Engineering posts at the pedagogy + design-principles altitude (contrast with concrete-numbers posts like the 2024-12-05 OPA-in-Skipper ingest, the 2025-02-16 Route Server ingest, or the 2023-01-30 1,200-playbooks ingest).

The diagram image at img01.ztat.net/engineering-blog/posts/2025/01/images/json-web-key-rotation.png is a schematic of the six-phase lifecycle; no additional numerical content.

Caveats

  • Pedagogy altitude, not incident retrospective. No production incident, no rotation-gone-wrong story, no operational numbers. The post is useful for canonicalising the shape of the lifecycle but provides no evidence about edge cases under fleet-scale load.
  • No emergency-rotation discussion. The post describes scheduled rotation only. Emergency rotation (private-key compromise) has different structure — it requires immediately invalidating outstanding tokens, which is the opposite of seamless. The post silently avoids this distinction; covered in concepts/signing-key-rotation-lifecycle#boundary-conditions.
  • Single-active-key model assumed. Some IdPs rotate with overlap (two keys actively signing for a window) to support deployments where not all signing instances have picked up the new key simultaneously. Zalando's article describes the strict single-active-key model; multi-active is a generalisation not discussed here.
  • Implementation details of "secure key management" opaque. HSM? KMS? Split-custody? The post says "industry best practices" and stops there. No disclosure about private-key storage surface, access controls, or ceremony requirements.
  • No mention of cross-region or multi-region IdP behaviour. Zalando is Europe-centric; global/multi-region IdP setups introduce cache-invalidation + clock-skew considerations the post doesn't address.
  • Closing recruiting pitch. Standard Zalando Engineering callout at the end; doesn't affect the architectural substance but signals the post's primary audience is recruiting-adjacent rather than incident-postmortem-adjacent.

Scope notes

Tier-2 Zalando, on-scope. The post is decidedly thin on numbers and operational detail — but the architectural content (the six-phase lifecycle, the two gates, the formula, the four principles) is the canonical public prose-level description of how a production OIDC IdP rotates its signing keys without breaking verifiers. This is load-bearing identity- infrastructure content. Per AGENTS.md scope rules:

  • "distributed systems internals, scaling trade-offs, infrastructure architecture, production incidents, storage / networking / streaming design" — covers infrastructure architecture for the IdP signing-key surface; the lifecycle + gates + formula are the mechanism-level architecture.
  • Not product PR (no product launch, no "introducing"); not hiring-focused (recruiting callout is incidental, not the centre of gravity); not pure ML.

Borderline-case reasoning: the post is short and could be mistaken for a primer, but the ordered discipline it canonicalises is the architectural substrate that every subsequent JWT / OIDC / federation-identity ingest on the wiki references. Skipping it would leave the four concept pages (JWK, signing-key-rotation-lifecycle, cache-control-aware- grace-period, retirement-plus-lifespan-plus-buffer-formula), the one pattern page (phased-automated-jwk-rotation), and the one system page (zalando-oidc-identity-provider) without their canonical source anchor.

Source

Last updated · 550 distilled / 1,221 read