PATTERN Cited by 1 source
Playbooks as markdown with CODEOWNERS¶
Intent¶
Manage a fleet-scale corpus of incident playbooks (hundreds to thousands of documents, written by dozens to hundreds of teams) using Markdown files in a git repository rendered by mkdocs or similar, with GitHub CODEOWNERS delegating review authority to departmental representatives instead of routing every review through a central SRE team.
Context¶
Playbooks live at the intersection of three pressures that collaboration tools struggle to cover simultaneously:
- Structured data — each playbook has the same fields (trigger, MTTR, impacts, steps). A wiki is too free-form; a UI-driven form is too rigid.
- Cross-team authoring — hundreds of teams contribute. Central-team authorship doesn't scale; review by a single team becomes the bottleneck.
- Versioning + review — playbook steps mutate as systems evolve. You need diffs, review trails, and rollback — the shape that git provides natively.
Solution¶
- One Markdown file per playbook, in a git repo.
Structure imposed by convention + template + CI lint
(e.g. every playbook's front matter must carry
trigger,mtr,impact,steps; file must start with a# TitleH1). Markdown front matter holds the metadata that the metadata JSON build output consumes. - mkdocs (or similar) renders the repo into a searchable documentation site. Readers browse by team → service → playbook; incident commanders search by trigger keyword.
- Directory structure auto-generated from the on-call source of truth (OpsGenie in Zalando's case, PagerDuty / Opslevel / internal registry elsewhere): every on-call team has a pre-seeded empty directory, so contributing is a matter of filling in an obvious location rather than creating a new path from scratch.
- Review via pull requests. Each playbook change is a PR; CI runs lint checks (required fields, expiry-date sanity, step syntax).
- CODEOWNERS delegates review authority to departmental representatives skilled in operational excellence, not to individual service teams. One reviewer can approve cross-team playbooks without rotating through every contributing team's owners.
- PR template with check-box self-verification nudges new contributors through the playbook guidelines. First line of the template is a TODO for a "1-line summary of changes" — gives reviewers context cheaply.
Benefits¶
- Cross-team review scales. PR mechanics are universal at GitHub / GitLab / internal equivalents; every engineer knows how to use them.
- Audit trail by construction. Every change has a commit author, a PR reviewer, a merge timestamp, and a diff. Pre-approval attestations inherit git's audit properties.
- Rollback is free. If a change broke a playbook, revert the commit.
- Structure through convention + CI, not UI constraint. Markdown + front matter + lint rules deliver the structured- data benefit without locking the corpus into a particular tool.
- Review delegation without central bottleneck. CODEOWNERS moves review authority from one central team to department- scoped reviewers, following Zalando's 2019→2023 evolution: "When we started in 2019 we had a team of 3 reviewers, who as part of the playbook reviews were committed throughout the year to explain the purpose/guidance of the playbooks and align these to a common standard. With sufficient examples and knowledge spread across the organization, we switched to using CODEOWNERS to delegate the reviews to representatives of the departments, skilled in operational excellence." (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)
- Composes with CI linting + metadata export. The same build step that renders the docs emits the expiry-aware metadata JSON consumed by the application-review workflow.
Costs and caveats¶
- Markdown is not ideal for structured data. Zalando's explicit acknowledgment: "managing structured data in markdown is not ideal, despite the ability to use front matter for metadata. However, managing playbooks in a code repository provides us with easy means for cross-team reviews using pull requests." The trade-off is chosen in favour of git-native review.
- Lint rules become the schema. You need the CI gate to catch missing fields / malformed expiry dates / empty steps. Without it, the convention drifts.
- Search quality is only as good as the rendering. mkdocs search is fine for a few hundred files; at thousands, consider full-text search with field-aware queries.
- Incident-time access is a dependency. During an incident, the docs site has to be reachable. Playbooks that mitigate the failure modes that would break the docs site itself (network partition, authentication outage) need an out-of-band copy.
- Template discipline matters. A PR template with the TODO "1-line summary of changes" is called out by Zalando as a "small but effective" reviewer-context lever — small process nudges that make a big difference to review quality.
- CODEOWNERS maintenance. The departmental reviewer list has to be kept current as people change roles; stale CODEOWNERS means stale approvals.
Known uses¶
- Zalando — 1,200+ playbooks, 100+ on-call teams, 850+ applications. Directory layout auto-generated from OpsGenie; review via CODEOWNERS to departmental representatives; PR template with self-verification checklist and 1-line-summary TODO. Canonical wiki instance. (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)
Related patterns¶
- patterns/playbook-metadata-integrated-with-app-reviews — the consumer of the metadata this pattern emits.
- patterns/template-project-nudges-consistency — the same "defaults + nudge" organisational lever applied to service templates.
- concepts/sre-organizational-evolution — the grassroots → committed-team → delegated-reviewers arc this pattern sits inside.
Anti-patterns¶
- Central-team-as-bottleneck review forever — Zalando started there in 2019 and the post explicitly names the switch to delegated review as a scaling requirement.
- Per-service-team review only — misses the cross-team consistency that the central team or departmental-rep layer provides.
- Out-of-git-system for playbooks — loses PR mechanics / audit trail / diffs.
- UI-driven structured playbook authoring with no git export — wins structured-data but loses cross-team review substrate.
Related¶
- concepts/incident-playbook — the unit being managed.
- concepts/pre-approved-degradation-procedure — the approval semantics this pattern stores.
- concepts/playbook-expiry-date — the freshness field this pattern emits.
- concepts/sre-organizational-evolution — the org-evolution shape this pattern's review-delegation matches.
- systems/zalando-incident-playbooks-site — Zalando's rendered artefact.
- systems/opsgenie — the on-call directory source.
- companies/zalando — axis 28.