CONCEPT Cited by 1 source
Playbook expiry date¶
A playbook expiry date is a per-playbook timestamp that triggers a re-review of the playbook once reached. The expiry-and-renew loop is a forcing function against silent drift — the failure mode where a documented playbook stops working between authoring and next use because the trigger query broke, a config path changed, a downstream dependency was retired, or the business-owner approval no longer matches the current business context.
Definition¶
Zalando's framing:
"Expiry date – allows to nudge teams to re-review playbooks that will expire soon." — sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks
At Zalando, the expiry date is a field in the per-playbook metadata (emitted as JSON alongside the rendered docs site). The application-review workflow reads the JSON and shows, per application, whether any of its playbooks are expired or approaching expiry — turning the freshness question into part of the periodic reliability audit, not a per-team clean-up task.
The drift modes an expiry guards against¶
- Trigger drift. "Metrics ingestion SLO at risk" was the trigger in 2021; in 2024 the metrics ingestion SLO was renamed / redefined / removed. The playbook trigger now references a query that never fires.
- Steps drift. "Edit
configs/metrics/tiers.yaml" was the mitigation in 2021; in 2024 the config moved to a different service with a different path. The steps are unrunnable. - Impact drift. "Business impact: outfits won't be shown on the catalog" was accurate in 2021; in 2024 outfits have been replaced with a different surface and the impact is now nothing (or much more, if outfits power a new revenue stream).
- Approval drift. The business owner who pre-approved the playbook left; the current owner hasn't been asked; the pre-approval semantics are stale (concepts/pre-approved-degradation-procedure coverage gone).
- Dependency drift. The playbook disables calls to service X; service X has been replaced by service Y; disabling X does nothing, and Y is now the one overloaded.
Expiry doesn't prevent any of these drifts; it requires the owning team to look at the playbook on a cadence. Each look is an opportunity to notice the drift before the next incident.
Choosing the cadence¶
The right expiry cadence balances review burden against drift-risk:
- Too frequent (quarterly) — review becomes rubber-stamping; teams ignore the notifications; the forcing function loses force.
- Too infrequent (3 years) — drift accumulates beyond quick-fix; re-approval with current business owner is a re-design exercise.
- Often-chosen cadence: annual — matches the typical service-ownership review cycle, compatible with annual peak event-driven reliability work, gives enough breathing room that review isn't a chore.
Zalando doesn't publicly name its default expiry cadence in this post; the mechanism (per-playbook date, review-workflow surfacing) is described without the specific value.
Interaction with the application-review workflow¶
Expiry dates are surfaced as part of the application-review workflow rather than via standalone notifications:
"During the application review process it's indicated per application (from certain criticality tier onward) whether there are any playbooks defined for it and whether any of these are expired."
This design choice is load-bearing: by making the expiry visibility a property of the application's reliability posture, the freshness check inherits the weight of the overall readiness review. The owning team sees "your app has 4 playbooks, 2 are expired" in the same audit UI where they see "your service has no on-call rotation set" or "your service has no SLOs". The former is hard to ignore when the latter is also scrutinised.
Canonicalised pattern¶
The overall shape — metadata-file with expiry + review-gate that consumes it — is canonicalised as patterns/playbook-metadata-integrated-with-app-reviews. The expiry field is one lever of that pattern; the app-assignment field is the other.
Seen in¶
- Zalando incident playbooks (2019-present) — sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks — expiry dates emitted in per-playbook JSON metadata; application-review workflow surfaces expiring / expired playbooks per application, making freshness part of reliability-posture audits.
Related¶
- concepts/incident-playbook — the thing being expired.
- concepts/pre-approved-degradation-procedure — the approval that expiry renews.
- concepts/production-readiness-review — the review gate that consumes expiry signal.
- patterns/playbook-metadata-integrated-with-app-reviews — the overall mechanism.
- companies/zalando — axis 28.