Skip to content

PATTERN Cited by 1 source

Playbook metadata integrated with app reviews

Intent

Make incident playbooks a first-class part of the application-review workflow by emitting a structured metadata file alongside the rendered playbook docs, and consuming that metadata in the existing application-review UI to surface "does this application have playbooks?" and "are any expired?" checks per application.

Context

Writing playbooks is optional until it isn't. Three failure modes without this integration:

  1. Critical applications have no playbooks. The docs site gets populated by enthusiastic teams, but other teams never onboard. During an incident on a playbook-less service, the responder is back to ad-hoc decision making.
  2. Playbook presence is invisible. Even when playbooks exist, incident responders don't know which applications have them without searching the docs site. During an incident, search is not what you want.
  3. Playbooks drift silently. Without a surfaced expiry indicator, stale playbooks go unnoticed until their next (failed) use.

Solution

  1. Emit a JSON metadata file from the playbook doc build, one entry per playbook, carrying the structured front- matter fields: owning application(s), author team, trigger, MTTR, business-impact, operational-impact, and expiry date.
  2. Consume the JSON in the existing application-review workflow (production-readiness review, service-catalog UI, on-call readiness audit). Per application, show:
  3. Count of playbooks assigned.
  4. Count of expired / expiring-soon playbooks.
  5. Direct links to the playbook docs.
  6. Make playbook presence mandatory for applications above a certain criticality tier (concepts/service-tier-classification) — if a tier-1 application has zero playbooks, its review flags a failing check.
  7. Trigger re-review on expiry. The review workflow treats expired playbooks as a failing check, pushing the owning team to refresh or retire.

Benefits

  • Visibility. Which applications have playbooks? Which playbooks are expired? Shown in the place teams already look for reliability posture.
  • Shared enforcement substrate. The production-readiness-review workflow already gates applications on observability, on-call rotation, SLOs — adding playbook presence uses the same machinery.
  • Forcing function for onboarding. Mandatory at tier-1 makes writing playbooks part of the cost of operating a tier-1 service, not a side project.
  • Forcing function for freshness. Expiry surfaced in a routine review makes drift detectable.
  • Secondary benefit: broader ownership. Zalando noted that making playbooks mandatory for certain criticality tiers "partially increased the scope of the playbooks beyond the key emergency procedures while at the same time providing training to our engineers in the authoring of playbooks and thinking about the overload and failure scenarios that can occur." (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks) The mandatory check teaches more engineers to think in playbook shape.

Costs and caveats

  • Needs a JSON emission step. The build (mkdocs or similar) has to produce structured output alongside the rendered docs. Not free; not hard either.
  • Needs an existing application-review workflow. If the org doesn't already have one (service-catalog UI, PRR tooling, readiness-audit dashboard), this pattern doesn't apply — build the review workflow first.
  • Criticality-tier integration is load-bearing. Without the "mandatory at tier-1" rule, the review just shows information; nobody acts on it. The rule turns surfaced state into enforced policy.
  • False-positive churn. An aggressively short expiry cadence spams the review dashboard with failing checks. See concepts/playbook-expiry-date for cadence-tuning trade-offs.
  • Owner-of-record complexity. If an application is owned jointly by two teams, which one's playbook count shows? The metadata schema has to accommodate multi-ownership or attribute disputes surface at review time.

Known uses

  • Zalando, 2019-present — JSON metadata generated alongside the mkdocs-rendered incident playbooks site. Application-review workflow shows per-application playbook assignment and expiry state. Mandatory for applications above a certain criticality tier. Canonical wiki instance. (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)

Anti-patterns

  • Docs-only — playbooks exist in a rendered site but aren't connected to the review workflow; no pressure to cover tier-1 apps, no pressure to refresh expiring playbooks.
  • Mandatory without mechanism — review requires playbooks but no automated check confirms their presence; becomes a paperwork exercise.
  • Metadata without consumer — JSON emitted, no UI consumes it, engineers don't see their state.
Last updated · 550 distilled / 1,221 read