Skip to content

PATTERN Cited by 1 source

Playbooks as markdown with CODEOWNERS

Intent

Manage a fleet-scale corpus of incident playbooks (hundreds to thousands of documents, written by dozens to hundreds of teams) using Markdown files in a git repository rendered by mkdocs or similar, with GitHub CODEOWNERS delegating review authority to departmental representatives instead of routing every review through a central SRE team.

Context

Playbooks live at the intersection of three pressures that collaboration tools struggle to cover simultaneously:

  • Structured data — each playbook has the same fields (trigger, MTTR, impacts, steps). A wiki is too free-form; a UI-driven form is too rigid.
  • Cross-team authoring — hundreds of teams contribute. Central-team authorship doesn't scale; review by a single team becomes the bottleneck.
  • Versioning + review — playbook steps mutate as systems evolve. You need diffs, review trails, and rollback — the shape that git provides natively.

Solution

  1. One Markdown file per playbook, in a git repo. Structure imposed by convention + template + CI lint (e.g. every playbook's front matter must carry trigger, mtr, impact, steps; file must start with a # Title H1). Markdown front matter holds the metadata that the metadata JSON build output consumes.
  2. mkdocs (or similar) renders the repo into a searchable documentation site. Readers browse by team → service → playbook; incident commanders search by trigger keyword.
  3. Directory structure auto-generated from the on-call source of truth (OpsGenie in Zalando's case, PagerDuty / Opslevel / internal registry elsewhere): every on-call team has a pre-seeded empty directory, so contributing is a matter of filling in an obvious location rather than creating a new path from scratch.
  4. Review via pull requests. Each playbook change is a PR; CI runs lint checks (required fields, expiry-date sanity, step syntax).
  5. CODEOWNERS delegates review authority to departmental representatives skilled in operational excellence, not to individual service teams. One reviewer can approve cross-team playbooks without rotating through every contributing team's owners.
  6. PR template with check-box self-verification nudges new contributors through the playbook guidelines. First line of the template is a TODO for a "1-line summary of changes" — gives reviewers context cheaply.

Benefits

  • Cross-team review scales. PR mechanics are universal at GitHub / GitLab / internal equivalents; every engineer knows how to use them.
  • Audit trail by construction. Every change has a commit author, a PR reviewer, a merge timestamp, and a diff. Pre-approval attestations inherit git's audit properties.
  • Rollback is free. If a change broke a playbook, revert the commit.
  • Structure through convention + CI, not UI constraint. Markdown + front matter + lint rules deliver the structured- data benefit without locking the corpus into a particular tool.
  • Review delegation without central bottleneck. CODEOWNERS moves review authority from one central team to department- scoped reviewers, following Zalando's 2019→2023 evolution: "When we started in 2019 we had a team of 3 reviewers, who as part of the playbook reviews were committed throughout the year to explain the purpose/guidance of the playbooks and align these to a common standard. With sufficient examples and knowledge spread across the organization, we switched to using CODEOWNERS to delegate the reviews to representatives of the departments, skilled in operational excellence." (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)
  • Composes with CI linting + metadata export. The same build step that renders the docs emits the expiry-aware metadata JSON consumed by the application-review workflow.

Costs and caveats

  • Markdown is not ideal for structured data. Zalando's explicit acknowledgment: "managing structured data in markdown is not ideal, despite the ability to use front matter for metadata. However, managing playbooks in a code repository provides us with easy means for cross-team reviews using pull requests." The trade-off is chosen in favour of git-native review.
  • Lint rules become the schema. You need the CI gate to catch missing fields / malformed expiry dates / empty steps. Without it, the convention drifts.
  • Search quality is only as good as the rendering. mkdocs search is fine for a few hundred files; at thousands, consider full-text search with field-aware queries.
  • Incident-time access is a dependency. During an incident, the docs site has to be reachable. Playbooks that mitigate the failure modes that would break the docs site itself (network partition, authentication outage) need an out-of-band copy.
  • Template discipline matters. A PR template with the TODO "1-line summary of changes" is called out by Zalando as a "small but effective" reviewer-context lever — small process nudges that make a big difference to review quality.
  • CODEOWNERS maintenance. The departmental reviewer list has to be kept current as people change roles; stale CODEOWNERS means stale approvals.

Known uses

Anti-patterns

  • Central-team-as-bottleneck review forever — Zalando started there in 2019 and the post explicitly names the switch to delegated review as a scaling requirement.
  • Per-service-team review only — misses the cross-team consistency that the central team or departmental-rep layer provides.
  • Out-of-git-system for playbooks — loses PR mechanics / audit trail / diffs.
  • UI-driven structured playbook authoring with no git export — wins structured-data but loses cross-team review substrate.
Last updated · 550 distilled / 1,221 read