Skip to content

SYSTEM Cited by 1 source

Zalando incident playbooks site

The Zalando incident playbooks site is the internal documentation site that renders Zalando's fleet-wide incident- playbook corpus — 1,200+ playbooks authored by 100+ on-call teams across 850+ applications as of the 2023-01 retrospective. It is a canonical wiki instance of the Markdown + CODEOWNERS playbook-management shape.

Architecture

  • Git repository of Markdown files — one file per playbook.
  • mkdocs renderer produces the documentation site.
  • Directory structure auto-generated from OpsGenie's on-call-team list — every team gets a pre-seeded empty directory to populate.
  • GitHub CODEOWNERS delegates review authority to departmental representatives skilled in operational excellence (not per-team service owners, not a central SRE team).
  • Pull-request template with check-box self-verification nudges new contributors through playbook guidelines. First line is a TODO for a "1-line summary of changes" — small but effective reviewer-context lever.
  • JSON metadata file emitted per build alongside the rendered site, carrying per-playbook fields (application, expiry date, trigger, MTTR, impacts, steps). Consumed by the application-review workflow per patterns/playbook-metadata-integrated-with-app-reviews.
  • Review-authority evolved over time — from 3-person SRE guild (2019) → departmental CODEOWNERS (post-2019, exact date not disclosed).

Playbook schema

Every playbook carries:

Field Purpose
Title Brief description of the mitigation.
Trigger Observable condition justifying execution.
Mean time to recover Wall-clock from config-change to observable impact.
Operational health impact Load / capacity freed; often quantified.
Business impact Customer-visible consequence.
Steps Actual sequence to execute.
Application (metadata) Which application(s) the playbook covers.
Expiry date (metadata) When the playbook must be re-reviewed.

Schema enforced by convention + CI lint + PR template check-boxes.

Scale (as of 2023-01)

Metric Value
Playbooks 1,200+
Applications covered 850+
On-call teams 100+
Playbooks per application (avg) 1.41
Playbooks per team (avg) ~12
Distribution Most apps have few; long tail has one or two.

Growth pattern: Q3 spikes every year, tracking Cyber-Week preparations — canonical instance of annual peak event as capability forcing function.

Worked examples in the 2023 post

  1. Catalog / product-listing pages — playbook set for Elasticsearch-overload mitigation. Ordered outfit-disable → sponsored-product-disable → teaser-disable (least- impact first). Cyber-Week catalog-latency incident used the set in sequence while another sub-team diagnosed root cause.
  2. ZMON — metrics-ingestion-SLO playbook with three-tier metrics criticality. Tier-3 + tier-2 dropped on demand under TSDB overload → 40% load reduction, 2-minute MTTR, zero business impact. Canonical instance of patterns/drop-non-critical-metrics-under-tsdb-overload.

Seen in

Last updated · 550 distilled / 1,221 read