SYSTEM Cited by 1 source
Zalando incident playbooks site¶
The Zalando incident playbooks site is the internal documentation site that renders Zalando's fleet-wide incident- playbook corpus — 1,200+ playbooks authored by 100+ on-call teams across 850+ applications as of the 2023-01 retrospective. It is a canonical wiki instance of the Markdown + CODEOWNERS playbook-management shape.
Architecture¶
- Git repository of Markdown files — one file per playbook.
- mkdocs renderer produces the documentation site.
- Directory structure auto-generated from OpsGenie's on-call-team list — every team gets a pre-seeded empty directory to populate.
- GitHub CODEOWNERS delegates review authority to departmental representatives skilled in operational excellence (not per-team service owners, not a central SRE team).
- Pull-request template with check-box self-verification nudges new contributors through playbook guidelines. First line is a TODO for a "1-line summary of changes" — small but effective reviewer-context lever.
- JSON metadata file emitted per build alongside the rendered site, carrying per-playbook fields (application, expiry date, trigger, MTTR, impacts, steps). Consumed by the application-review workflow per patterns/playbook-metadata-integrated-with-app-reviews.
- Review-authority evolved over time — from 3-person SRE guild (2019) → departmental CODEOWNERS (post-2019, exact date not disclosed).
Playbook schema¶
Every playbook carries:
| Field | Purpose |
|---|---|
| Title | Brief description of the mitigation. |
| Trigger | Observable condition justifying execution. |
| Mean time to recover | Wall-clock from config-change to observable impact. |
| Operational health impact | Load / capacity freed; often quantified. |
| Business impact | Customer-visible consequence. |
| Steps | Actual sequence to execute. |
| Application (metadata) | Which application(s) the playbook covers. |
| Expiry date (metadata) | When the playbook must be re-reviewed. |
Schema enforced by convention + CI lint + PR template check-boxes.
Scale (as of 2023-01)¶
| Metric | Value |
|---|---|
| Playbooks | 1,200+ |
| Applications covered | 850+ |
| On-call teams | 100+ |
| Playbooks per application (avg) | 1.41 |
| Playbooks per team (avg) | ~12 |
| Distribution | Most apps have few; long tail has one or two. |
Growth pattern: Q3 spikes every year, tracking Cyber-Week preparations — canonical instance of annual peak event as capability forcing function.
Worked examples in the 2023 post¶
- Catalog / product-listing pages — playbook set for Elasticsearch-overload mitigation. Ordered outfit-disable → sponsored-product-disable → teaser-disable (least- impact first). Cyber-Week catalog-latency incident used the set in sequence while another sub-team diagnosed root cause.
- ZMON — metrics-ingestion-SLO playbook with three-tier metrics criticality. Tier-3 + tier-2 dropped on demand under TSDB overload → 40% load reduction, 2-minute MTTR, zero business impact. Canonical instance of patterns/drop-non-critical-metrics-under-tsdb-overload.
Seen in¶
- Zalando incident playbooks (2019-present) — sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks — canonical wiki instance.
Related¶
- concepts/incident-playbook — the unit this site renders.
- concepts/pre-approved-degradation-procedure — the semantic property each playbook carries.
- concepts/playbook-expiry-date — one of the metadata fields emitted.
- concepts/production-readiness-review — the consumer workflow.
- patterns/playbooks-as-markdown-with-codeowners — the management pattern this site instantiates.
- patterns/playbook-metadata-integrated-with-app-reviews — the consumer integration.
- systems/opsgenie — directory source of truth.
- systems/zmon — one of the worked-example systems whose playbook lives here.
- companies/zalando — axis 28.