CONCEPT

Incident playbook¶

An incident playbook is a written, structured, pre-approved emergency procedure that an incident responder can execute to put a system into a known degraded state during an outage — without needing real-time approval from the business owner of the affected functionality. It is distinct from a diagnostic runbook (which helps a responder understand what is happening) and from an unplanned-failover playbook (which is a specific failover sequence); a playbook is the authorisation artefact plus the action sequence for a specific mitigation.

Definition¶

Zalando's canonical framing:

"Our Incident Playbooks cover emergency procedures to initiate in case a certain set of conditions is met, for example when one of our systems is overloaded and the existing resiliency measures (e.g. circuit breakers) are insufficient to mitigate the observed customer impact. In such cases there are often measures we can take, though they will degrade the customer experience. These emergency procedures are pre-approved by the respective Business Owner of the underlying functionality, allowing for quicker incident response without the need for explicit decision making while critical issues are ongoing." —

Three defining properties:

Pre-approved by the business owner, in advance of any incident — the business-impact trade-off has already been judged acceptable. Canonicalised as concepts/pre-approved-degradation-procedure.
Conditional on a named trigger. A playbook is keyed to a specific observable condition ("catalog latency > X", "TSDB ingestion SLO at risk"), not a generic "the site is slow". The trigger is part of the playbook, not external context.
Reversible and bounded. Every playbook names its mean time to recover (seconds → minutes), operational impact (what load reduces / what capacity is freed), and business impact (what the customer experiences). Reversibility is assumed — after the incident, the system returns to the pre-playbook state.

The fixed structure¶

Every Zalando playbook is authored in the same shape:

Title — what mitigation the playbook implements.
Trigger — the symptom that justifies executing the playbook.
Mean time to recover — wall-clock from config-change to observable impact.
Operational health impact — what load / request rate / capacity is freed (often quantified).
Business impact — what the customer sees / doesn't see.
Steps — the actual sequence to execute.

The shared structure is load-bearing: incident commanders + cross- team responders read the same format across dozens of teams and hundreds of services, so collaboration during an outage doesn't require re-parsing each team's prose.

Playbook vs runbook¶

Axis	Runbook	Playbook
Goal	Diagnose / recover	Degrade gracefully
Approval	Operator judgement	Pre-approved business-impact trade-off
Output	Back-to-normal state	Known-degraded state
Trigger	Varied / often open-ended	Specific named condition
Reversibility	Assumes recovery	Explicitly reversible
Typical owner	Service team	Business + engineering owners

The distinction matters: a runbook asks "how do I fix this?"; a playbook asks "what can I turn off to survive this while fixing gets longer?". They compose — a runbook may point at playbook X as a stabilisation step while diagnosis continues.

Scope: system-wide, not per-service¶

Zalando's explicit design decision:

"More often than not, our playbooks cover the whole system (a few microservices) instead of its individual components being covered through separate procedures. When the bigger system context is considered, there are more options available to mitigate issues."

Canonicalised as concepts/system-level-playbook-scope. Per-microservice playbooks miss the compositional options — "disable feature A in service X to protect service Y" is a system-level playbook, not a per-service one.

Ordering within a playbook set¶

Playbooks for the same system are ordered by business impact: when a responder is walking the set during an overload, they apply the least-impactful mitigation first and escalate only if needed. Zalando's catalog / product-listing example applies outfit disabling first (small UX impact), then sponsored product disabling, then teaser disabling. Ordering is a property of the playbook set, not of individual playbooks.

Freshness: expiry dates¶

Each playbook carries an expiry date (concepts/playbook-expiry-date); the application-review workflow flags approaching / past expiry so the owning team re-verifies the trigger, steps, and downstream dependencies are still valid. The failure mode this guards against is silent drift — the trigger query breaks, a config path changes, a downstream dependency retires — and the playbook fails during the next incident without warning.

Relationship to graceful degradation¶

An incident playbook is the operational artefact that realises concepts/graceful-degradation for a specific degradation-option. Where graceful degradation names the architectural property ("the system continues in a reduced-but- useful mode"), the playbook names the mechanism ("if X, then execute Y to reach the degraded state") and the authorisation (the business owner approved Y's trade-off in advance). Without playbooks, graceful-degradation hooks exist but go unused during incidents because the responder can't get permission in time.

Relationship to alert fatigue¶

Pre-approved playbooks reduce the decision-making load on the incident responder, which is one of the structural interventions against alert fatigue. A page that arrives with a linked playbook converts to "execute playbook X" in seconds; a page without a playbook is cognitively open-ended.

Seen in¶

Zalando, 2019-2023 — — scaled from zero to 1,200+ playbooks across 100+ on-call teams and 850+ applications (1.41 per application, ~12 per team). Canonical wiki instance. Managed via Markdown + mkdocs
GitHub + CODEOWNERS; metadata exposed as JSON into the application-review workflow; mandatory above a certain criticality tier. Two worked examples: catalog-pages (outfit-first ordering) and ZMON (three-tier metrics drop).

concepts/pre-approved-degradation-procedure — the authorisation-semantics half.
concepts/playbook-ordering-by-business-impact — the sequencing property.
concepts/playbook-expiry-date — the freshness property.
concepts/system-level-playbook-scope — the scope decision.
concepts/graceful-degradation — the architectural property playbooks operationalise.
concepts/production-readiness-review — the review gate that playbook metadata feeds.
concepts/unplanned-failover-playbook — the specific failover subtype (primary crash → promote replica).
concepts/alert-fatigue — reduced by pre-approved procedures.
patterns/playbooks-as-markdown-with-codeowners — the management substrate.
patterns/playbook-metadata-integrated-with-app-reviews — how playbooks become mandatory.
patterns/drop-non-critical-metrics-under-tsdb-overload — canonical worked example.