ZALANDO 2023-01-30 Tier 2

Zalando — How we manage our 1,200 incident playbooks¶

Summary¶

Retrospective from Zalando engineering (2023-01-30) on how their Incident Playbook program scaled from ad-hoc knowledge-in-heads in 2019 to 1,200+ playbooks authored across 100+ on-call teams covering 850+ applications by January 2023. Playbooks are pre-approved emergency procedures — the business owner has pre-approved the degraded state in advance, so incident responders can execute the procedure without real-time cross-team decision making. Each playbook carries a fixed structure: trigger (what symptom to watch for), mean time to recover, operational health impact (e.g. "reduces request rate to Elasticsearch by x%"), business impact (e.g. "outfits won't be shown on the catalog"), and steps. Two worked examples anchor the post: catalog / product-listing pages (playbook sequence ordered least business impact first — disable outfits, then sponsored products, then teasers — when Elasticsearch load spikes) and ZMON (Zalando's monitoring system, backed by KairosDB on Cassandra, with a three-tier metrics criticality scheme introduced during Cyber Week pre-scaling so that tier-3 and tier-2 metrics can be dropped on demand when the TSDB is overloaded, yielding 40% load reduction to keep the Cyber-Week dashboards working).

The management story is the load-bearing architectural content: the playbook corpus lives as Markdown files in a git repo rendered by mkdocs, with directory structure auto-generated from OpsGenie on-call teams so every team has a pre-seeded skeleton. Reviews started with a committed 3-person SRE-guild review rotation (explaining guidelines, aligning to a common standard); once enough examples existed the review authority was delegated to departmental representatives via GitHub CODEOWNERS. A pull-request template with check-box self-verification nudges new contributors through the playbook guidelines, and the "1-line summary" TODO in the template's first line turned out to be a small but effective context lever for reviewers. Alongside the rendered docs, a JSON metadata file is emitted with every build; the application-review workflow consumes it to answer "does this application have a playbook assigned?" and "are any of its playbooks expired?" — playbook presence is mandatory for applications above a certain criticality tier, and expiry dates force re-review cadence. The closing stance is a first-principles design recommendation: have every team imagine failure scenarios and draft playbooks; writing the playbook often exposes resilience-mechanism gaps that drive engineering changes upstream. "If used often enough, playbooks should be ideally automated."

Key takeaways¶

Incident playbooks are pre-approved emergency procedures, not diagnostic runbooks. Verbatim: "Our Incident Playbooks cover emergency procedures to initiate in case a certain set of conditions is met … In such cases there are often measures we can take, though they will degrade the customer experience. These emergency procedures are pre-approved by the respective Business Owner of the underlying functionality, allowing for quicker incident response without the need for explicit decision making while critical issues are ongoing." The business-owner pre-approval is the load-bearing distinction from generic runbooks — canonicalised as concepts/pre-approved-degradation-procedure. The playbook is the authorisation artefact, not the troubleshooting artefact. (Source: sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)
Playbooks at Zalando's scale land at roughly 1.41 per application and 12 per on-call team. Numbers: 1,200+ playbooks / 850+ applications in on-call scope / 100+ on-call teams = 1.41 playbooks per application / ~12 playbooks per team. Most applications have only a handful; the distribution is skewed (some systems have dozens, the long tail has one or two). Gives a concrete production scale data point for any org sizing their own playbook program.
The playbook library grew in Cyber-Week-driven step functions. The playbook-count-over-time chart shows Q3 spikes every year corresponding to Cyber Week preparations — a direct quantitative instance of the annual peak event as a capability forcing function that Zalando canonicalised elsewhere. Each Cyber Week, the org added new playbooks as one of its invested capabilities; the ratchet held year-over-year.
Playbook scope is usually the system, not the microservice. Verbatim: "More often than not, our playbooks cover the whole system (a few microservices) instead of its individual components being covered through separate procedures. When the bigger system context is considered, there are more options available to mitigate issues." Canonicalises concepts/system-level-playbook-scope as the design decision — per-service playbooks lose the compositional mitigations ("disable A to protect B") that system-scope playbooks unlock.
Playbook structure is fixed and makes cross-team collaboration easier during outages. Every playbook names its conditions / business impact / operational impact / MTTR / steps. The consistency is load-bearing: "by having playbooks in a single location, our Incident Responders and Incident Commanders have easy access to all available emergency procedures in a consistent format. This simplifies collaboration across teams during outages." Complementary to the pre-approval — same people handling an outage across multiple services read the same shape and don't re-parse each team's prose.
Playbooks are ordered by business impact, and applied least-impact-first. The canonical catalog-pages example: Elasticsearch starts struggling, the playbooks are sorted so that outfit-call disabling (small UX impact) is applied first, sponsored product disabling next, teaser disabling third, etc. The sorting is a property of the playbook set for one system, not of individual playbooks — canonicalised as concepts/playbook-ordering-by-business-impact. Triaged during the Cyber-Week catalog-latency incident: "one part of the team was busy troubleshooting the issue, another part of the team executed multiple of the prepared playbooks in sequence in order to mitigate the customer impact."
ZMON + KairosDB worked example canonicalises metrics-tiering-by-criticality. Zalando's monitoring system (systems/zmon) ingests metrics into KairosDB (TSDB on Cassandra). Cyber-Week pre-scaling pushed multi-factor metrics-rate growth past the Cassandra cluster's ingestion capacity; the playbook that emerged classifies metrics into three criticality tiers and, on TSDB overload, drops tier-3 then tier-2 metrics to keep tier-1 (the ones that anchor Cyber-Week dashboards) flowing. Operational impact: 40% load reduction on the metrics TSDB; MTTR 2 minutes after config update; business impact: none (the dropped metrics are non-critical). Pattern canonicalised as patterns/drop-non-critical-metrics-under-tsdb-overload + concept concepts/metrics-tiering-by-criticality. Still in effect in 2023 despite the metrics store having been replaced — the mitigation "outlived the system it was written for".
**The playbook corpus runs on mkdocs + GitHub
CODEOWNERS. Architecture: a git repo holds one Markdown file per playbook; mkdocs renders it into a documentation site; the directory structure is generated from OpsGenie on-call teams so every team starts with a pre-seeded skeleton ("always a skeleton available for every team to contribute their playbooks to"). Canonical instance of patterns/playbooks-as-markdown-with-codeowners — Markdown structure + versioned git history + PR reviews + CODEOWNERS delegation. The trade-off is explicit: "managing structured data in markdown is not ideal, despite the ability to use front matter for metadata. However, managing playbooks in a code repository provides us with easy means for cross-team reviews using pull requests."
Review authority evolved grassroots → SRE guild → CODEOWNERS delegation. 2019: a 3-person SRE review team reviewed all playbooks, "committed throughout the year to explain the purpose/guidance of the playbooks and align these to a common standard." With enough examples + organisational knowledge, Zalando switched to CODEOWNERS to delegate reviews to departmental representatives skilled in operational excellence — a textbook move from "central team is the bottleneck" → "central team sets standard, distributed reviewers enforce." Same organisational-pattern shape as the Phase-2→Phase-3 SRE organizational evolution (concepts/sre-organizational-evolution).
Playbook metadata is integrated with the application-review workflow via a generated JSON artefact. Alongside the rendered mkdocs site, the build emits a JSON file with playbook metadata: "Application – links playbooks to the involved applications; Expiry date – allows to nudge teams to re-review playbooks that will expire soon." The application review process — Zalando's production-readiness review mechanism for services above a certain criticality tier — consumes this JSON to answer whether each application has playbooks defined and whether they are expired. Canonical instance of patterns/playbook-metadata-integrated-with-app-reviews. Made mandatory for applications of certain criticality, pushing the practice beyond the volunteer early-adopter set.
Expiry dates force re-review cadence. The metadata includes a per-playbook expiry date; the application review process flags playbooks approaching / past expiry so the owning team is pushed to re-verify and re-approve. Canonicalised as concepts/playbook-expiry-date — a freshness forcing-function against the failure mode where a playbook's trigger query breaks, its mitigation script's config path changes, or the downstream dependency is retired, and nobody notices until the playbook fails during an incident.
Writing playbooks exposes resilience-mechanism gaps that drive upstream engineering changes. Verbatim: "Imagining how to react to such scenarios by putting the system into a degraded state, trading off availability over customer experience, can spark interesting conversations about resilience mechanisms that can be built into the software. These conversations drive engineers to make changes to their design to fundamentally improve availability, or at least, to ensure their software facilitates easier intervention." The playbook-authoring exercise is itself a production-readiness lever; "how would you mitigate this?" as a question reliably surfaces "we couldn't mitigate this today" answers.
Closing stance: automate playbooks that are used often enough. "If used often enough, playbooks should be ideally automated." Zalando does not claim to have done this at scale yet in this post; the stance is aspirational and implicitly sets the roadmap direction — from human-executed → button-click → automatic response. Keeping the human-authored playbooks as the authoritative artefact even after automation shows up is the likely longer-term shape (auto-executor needs an authoritative source of truth anyway).

Architectural numbers¶

Metric	Value	Source in post
Total playbooks	1,200+	opening paragraph
Applications in on-call scope	850+	opening paragraph
On-call teams	100+	opening paragraph
Playbooks per application (avg)	1.41	opening paragraph
Playbooks per team (avg)	~12	opening paragraph
Catalog example: outfit-disable MTTR	3 minutes after config update	Catalog example
Catalog example: reduced request rate to Elasticsearch	x% (anonymised in post)	Catalog example
ZMON example: MTTR	2 minutes after config update	ZMON example
ZMON example: load reduction on TSDB	40%	ZMON example
ZMON example: metrics criticality tiers	3 (tier-1 tier-2 tier-3)	ZMON example
Initial SRE review team size	3 reviewers	mid-post
Program start year	2019 (Cyber Week prep)	opening
Post publication date	2023-01-30	article metadata

Caveats¶

No failover / automation numbers. The post doesn't quantify how many playbooks have been executed during real incidents, or how many human-authored playbooks have been converted to automatic responses. "If used often enough, playbooks should be ideally automated" is a stance, not a progress report.
No per-team playbook-distribution breakdown. A histogram is shown ("Number of applications per playbook count") but without table data on the wiki.
Pre-approval mechanism not detailed. Verbatim: "pre-approved by the respective Business Owner" — but the post doesn't disclose how approval is captured, where it's recorded, or how revocation works.
CODEOWNERS mechanism sketched, not deep-dived. The departmental-representatives switch is named but the granularity of the CODEOWNERS file (per-directory? per-team?) isn't shown.
JSON schema for metadata not shown. The Application / Expiry date fields are named but the schema, generation pipeline, and consumer implementation aren't specified.
Application review-workflow integration is opaque. "It's indicated per application (from certain criticality tier onward) whether there are any playbooks defined" — the tier-gating, UI shape, and enforcement mechanism are undescribed.
ZMON is marked as no longer the authoritative metrics store ("this playbook is still in place today, even though we changed our metrics storage") — the post does not disclose what replaced KairosDB+Cassandra.

Source¶

Concept: concepts/incident-playbook — the headline primitive.
Concept: concepts/pre-approved-degradation-procedure — what distinguishes a playbook from a runbook.
Concept: concepts/playbook-ordering-by-business-impact — apply least-impact first.
Concept: concepts/playbook-expiry-date — freshness forcing-function.
Concept: concepts/system-level-playbook-scope — why playbooks are usually at system, not microservice, scope.
Concept: concepts/metrics-tiering-by-criticality — the ZMON example's generalisation.
Concept: concepts/graceful-degradation — the umbrella property that pre-approved playbooks operationalise.
Concept: concepts/production-readiness-review — the review-workflow gate that playbook metadata integrates with.
Pattern: patterns/playbooks-as-markdown-with-codeowners — the mkdocs + GitHub + CODEOWNERS shape.
Pattern: patterns/playbook-metadata-integrated-with-app-reviews — JSON metadata + app-review integration.
Pattern: patterns/drop-non-critical-metrics-under-tsdb-overload — the ZMON worked example generalised.
Pattern: patterns/annual-peak-event-as-capability-forcing-function — Cyber-Week growth arc.
System: systems/zalando-incident-playbooks-site — the mkdocs-rendered documentation site.
System: systems/zmon — Zalando's monitoring system (worked example).
System: systems/opsgenie — on-call-team directory source.
Company: companies/zalando — axis 28 (incident-playbook library at fleet scale).