Zalando — JSON Web Keys (JWK): Rotating Cryptographic Keys at Zalando¶
Summary¶
Zalando's Customer Authentication Experience team describes
how their OpenID Connect (OIDC) identity provider (IdP)
rotates its JWT signing keys automatically via its
JWKS endpoint at
accounts.zalando.com/.well-known/jwk_uris. The post is a
short, mechanism-level canonicalisation of the
generate → publish →
grace → activate → retire → drop lifecycle for signing keys,
including the verbatim
drop_time
= retirement_time + max_token_lifespan + safety_buffer
formula, and the explicit emphasis that
cache-control
headers are the load-bearing knob that makes the transition
invisible to clients. Four design principles are named:
automation, scheduled rotation, secure key
management, and seamless rotation (zero client impact on
planned rotations). The post's canonical value is not scale
numbers (none disclosed) but the ordering discipline: the
six-phase lifecycle + two hard gates (grace before activation,
lifespan+buffer before drop) is the minimum viable shape for
rotating a long-lived federation trust anchor without breaking
verifiers.
Key takeaways¶
-
Static secrets are the failure mode being designed out. "Static secrets are evil. Whether secret keys hard-coded in source code, tokens without expiry or plaintext API keys referenced in configuration files, static secrets are ticking time bombs. The same is true for cryptographic key material in the context of JSON Web Tokens (JWTs) and OpenID Connect (OIDC)." The IdP's signing key is the most load-bearing long-lived key in the entire customer-auth architecture — if its private part leaks, "anyone could forge fake tokens … all tokens signed with the leaked key would become untrustworthy." Rotation is the structural defence (Source: sources/2025-01-20-zalando-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando). Canonical instance of concepts/long-lived-key-risk applied to OIDC IdP signing keys (tier-4 in the priority ladder: federation trust anchors).
-
JWK is the key-distribution web primitive that makes rotation cheap. "Identity providers (IdPs) like ours commonly use JWKs to distribute public key material via well-known and specified URIs. Clients can use the key material to e.g. verify digitally signed JSON Web Tokens (JWTs) issued by the IdP." JWK (RFC 7517) is part of the JOSE family. Without JWK's web-native JSON format and
kid- indexed set, every rotation would require coordinated distribution of new public keys across every client — which is the historical PITA that makes most teams skip rotation entirely. See concepts/jwk-json-web-key. -
The four principles the rotation system rests on are named explicitly: "Automation: New keys are generated and old keys are retired automatically, eliminating manual intervention and ensuring consistency. Scheduled Rotation: Keys are rotated on a regular basis to minimize the window of vulnerability. Secure Key Management: Our keys are securely stored and managed using industry best practices to protect them from unauthorized access. Seamless Rotation: Planned rotations are transparent to clients and do not result in any kind of access revocation or token invalidation." Automation + Seamless are the two the lifecycle mechanism operationalises; Scheduled + Secure are the operational context. Canonicalised as patterns/phased-automated-jwk-rotation.
-
The six-phase rotation lifecycle, verbatim: "First, a new key pair is generated. We then publish the public key portion of this new pair on our JWK endpoint, making it available to our clients. To avoid any immediate disruptions, we incorporate a grace period, allowing clients ample time to fetch the latest set of JWKs – cache control headers matter! After this period, the new key is being elected as the new active signing key. The previous active key is being retired, meaning it's no longer used for signing new tokens, but its public key remains available on the JWK endpoint to ensure that previously issued tokens can still be verified. Finally, once a retired key surpasses the maximum lifetime of any token it might have signed, we remove its public key from the JWK endpoint." This is the canonical public prose description of what the wiki canonicalises as generate → publish → grace → activate → retire → drop. See concept page for full state-machine analysis and both hard gates.
-
"Cache control headers matter!" — the grace period is measured in cache TTLs, not clock-time. The emphatic tell points at the load-bearing knob: JWKS responses are cached by clients (and often by CDNs, OIDC libraries with their own minimum-refresh policy, and intermediate proxies). The grace period before activating the new key must exceed the publisher's
Cache-Control: max-ageplus any downstream cache layer plus client-library refresh minimums. If the grace is too short, a JWT signed with the new key arrives at verifiers whose cached JWKS still lacks the newkid→ 401. Canonicalised as concepts/cache-control-aware-grace-period. -
The drop-time formula is a pure arithmetic function of IdP-controlled knobs: "We simply take the time the key was retired, add the maximum token lifespan, and add a little extra time just to be safe. At that point, any token signed with that key will have expired, so it's safe to remove the key from our public list." Because the IdP sets
exp - iaton issuance and every JWT carries akid, "when is it safe to drop retired key K?" is computable at retirement time without polling verifiers or measuring token usage. Two design choices —kid-in-header + IdP- controlled lifespan — are what make the formula a calculation, not a measurement. Canonicalised as concepts/retirement-plus-lifespan-plus-buffer-formula. Design consequence: short token lifespans (access tokens minutes-to-hours vs refresh tokens days-to-weeks) directly shorten the retention obligation for retired keys, which is why mature IdPs keep access-token lifespans low. -
Why the ordering is non-negotiable. Compressing the sequence breaks verifiers in predictable ways: skip publish/grace and verifiers see an unknown
kid; skip retire and tokens signed in the last window suddenly fail verification even though they haven't expired. The lifecycle preserves two verifier-facing invariants: (a) everykidin a token was in the JWKS before the token was signed, and (b) everykidstill valid at a verifier is still in the JWKS. Both invariants are preserved by the six-phase ordering; neither is preserved by any compression. See concepts/signing-key-rotation-lifecycle for full analysis of invariants and compression-failure modes.
Systems and concepts surfaced¶
Systems¶
- systems/zalando-oidc-identity-provider — Zalando's
customer-identity-platform OIDC IdP, publishes JWKS at
accounts.zalando.com/.well-known/jwk_uris. First canonical wiki instance.
Concepts¶
- concepts/jwk-json-web-key — the JSON key-distribution substrate standardised by RFC 7517 and part of the JOSE family; the thing published at the JWKS endpoint. Canonical Zalando instance.
- concepts/signing-key-rotation-lifecycle — the six-phase generate → publish → grace → activate → retire → drop sequence with two hard gates. Canonical source.
- concepts/cache-control-aware-grace-period — the load- bearing knob between publish and activate; grace measured in cache TTLs, not wall-clock.
- concepts/retirement-plus-lifespan-plus-buffer-formula —
the arithmetic rule that gates retire → drop:
drop_time = retirement_time + max_token_lifespan + safety_buffer. - concepts/long-lived-key-risk — the IdP signing key is the canonical federation trust anchor (tier-4 in the priority ladder). Its breadth (every relying party) and structural impact are why rotation discipline is non-optional.
- concepts/oidc-identity-federation — the OIDC framework that consumes JWKs as its key-distribution substrate; Zalando's customer-auth IdP is a first-party instance.
Patterns¶
- patterns/phased-automated-jwk-rotation — the automated system-level pattern that encodes the lifecycle as a scheduled loop over the JWKS endpoint; rolls up the four principles (automation / scheduled / secure / seamless) into a single repeatable rotation primitive.
Operational numbers¶
The post is pedagogy-altitude and discloses no operational numbers:
- No JWKS cache-control
max-agevalue. - No rotation cadence (daily? weekly? monthly?).
- No absolute grace-period duration.
- No max-token-lifespan value or safety buffer length.
- No fleet-size / rps framing for the JWKS endpoint.
- No per-rotation key-count ceiling (expected steady-state JWKS cardinality).
This is consistent with Zalando Engineering posts at the pedagogy + design-principles altitude (contrast with concrete-numbers posts like the 2024-12-05 OPA-in-Skipper ingest, the 2025-02-16 Route Server ingest, or the 2023-01-30 1,200-playbooks ingest).
The diagram image at
img01.ztat.net/engineering-blog/posts/2025/01/images/json-web-key-rotation.png
is a schematic of the six-phase lifecycle; no additional
numerical content.
Caveats¶
- Pedagogy altitude, not incident retrospective. No production incident, no rotation-gone-wrong story, no operational numbers. The post is useful for canonicalising the shape of the lifecycle but provides no evidence about edge cases under fleet-scale load.
- No emergency-rotation discussion. The post describes scheduled rotation only. Emergency rotation (private-key compromise) has different structure — it requires immediately invalidating outstanding tokens, which is the opposite of seamless. The post silently avoids this distinction; covered in concepts/signing-key-rotation-lifecycle#boundary-conditions.
- Single-active-key model assumed. Some IdPs rotate with overlap (two keys actively signing for a window) to support deployments where not all signing instances have picked up the new key simultaneously. Zalando's article describes the strict single-active-key model; multi-active is a generalisation not discussed here.
- Implementation details of "secure key management" opaque. HSM? KMS? Split-custody? The post says "industry best practices" and stops there. No disclosure about private-key storage surface, access controls, or ceremony requirements.
- No mention of cross-region or multi-region IdP behaviour. Zalando is Europe-centric; global/multi-region IdP setups introduce cache-invalidation + clock-skew considerations the post doesn't address.
- Closing recruiting pitch. Standard Zalando Engineering callout at the end; doesn't affect the architectural substance but signals the post's primary audience is recruiting-adjacent rather than incident-postmortem-adjacent.
Scope notes¶
Tier-2 Zalando, on-scope. The post is decidedly thin on numbers and operational detail — but the architectural content (the six-phase lifecycle, the two gates, the formula, the four principles) is the canonical public prose-level description of how a production OIDC IdP rotates its signing keys without breaking verifiers. This is load-bearing identity- infrastructure content. Per AGENTS.md scope rules:
- "distributed systems internals, scaling trade-offs, infrastructure architecture, production incidents, storage / networking / streaming design" — covers infrastructure architecture for the IdP signing-key surface; the lifecycle + gates + formula are the mechanism-level architecture.
- Not product PR (no product launch, no "introducing"); not hiring-focused (recruiting callout is incidental, not the centre of gravity); not pure ML.
Borderline-case reasoning: the post is short and could be mistaken for a primer, but the ordered discipline it canonicalises is the architectural substrate that every subsequent JWT / OIDC / federation-identity ingest on the wiki references. Skipping it would leave the four concept pages (JWK, signing-key-rotation-lifecycle, cache-control-aware- grace-period, retirement-plus-lifespan-plus-buffer-formula), the one pattern page (phased-automated-jwk-rotation), and the one system page (zalando-oidc-identity-provider) without their canonical source anchor.
Source¶
- Original: https://engineering.zalando.com/posts/2025/01/automated-json-web-key-rotation.html
- Raw markdown:
raw/zalando/2025-01-20-json-web-keys-jwk-rotating-cryptographic-keys-at-zalando-3c7967e2.md
Related¶
- systems/zalando-oidc-identity-provider
- concepts/jwk-json-web-key
- concepts/signing-key-rotation-lifecycle
- concepts/cache-control-aware-grace-period
- concepts/retirement-plus-lifespan-plus-buffer-formula
- concepts/long-lived-key-risk
- concepts/oidc-identity-federation
- patterns/phased-automated-jwk-rotation
- companies/zalando