CONCEPT Cited by 1 source
System-level playbook scope¶
System-level playbook scope is the design decision to write incident playbooks at the system altitude (several collaborating microservices) rather than the per-microservice altitude. The trade-off is more available mitigations (system-wide feature disables, cross-service capacity shifts, integration-level cut-offs) at the cost of more cross-team coordination at authoring time.
Zalando's framing¶
"More often than not, our playbooks cover the whole system (a few microservices) instead of its individual components being covered through separate procedures. When the bigger system context is considered, there are more options available to mitigate issues." — sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks
The claim is load-bearing: mitigation options are a function of scope, not of authoring effort. A playbook scoped to service A alone has access to service-A's mitigation levers (disable a feature flag, drop a cache tier, reduce a worker count). A playbook scoped to the five-service catalog system has access to all five services' levers and to integration-level mitigations (short-circuit the call from service B to service C; swap service D's dependency on service E for a cached fallback).
Why per-service scope loses mitigations¶
Three structural reasons:
- Service boundaries are not failure boundaries. An Elasticsearch overload on the catalog system could be triggered by any of the services that query it. A per-service playbook for service A only offers service-A-specific mitigations; the highest-leverage move ("disable outfit calls from ALL catalog services") is a cross-service action.
- Compositional mitigations don't exist at service scope. Turning off service X to protect service Y is a two-service fact; neither service's per-service playbook can legitimately document it without referencing the other.
- Business-impact reasoning is at system scope. The business owner who pre-approves (concepts/pre-approved-degradation-procedure) reasons about customer-visible functionality ("outfits on catalog pages"), not about individual services. The pre-approval has to be expressed at the altitude the owner thinks about.
The trade-off: authoring coordination¶
System-level scope requires playbook authors to understand and coordinate across multiple teams. Three costs:
- Ownership is compound. No single team owns the system; the playbook needs consensus from all services' teams that the steps are safe to execute.
- Updates propagate slower. A service-internal change that would naturally update a per-service playbook may go unnoticed at the system-scope level; the expiry forcing function has to compensate.
- Review gets harder. CODEOWNERS at system scope includes representatives from every contributing team, which makes every review a cross-team negotiation.
Zalando's mitigation is organisational — the playbook review authority is delegated to departmental representatives skilled in operational excellence rather than per-team service owners, so one reviewer can approve cross-team impact without rotating through N individual approvals.
When per-service scope is right¶
Not every playbook belongs at system scope:
- Deployment-layer mitigations (rolling restart, version rollback, scale adjustment) are per-service by construction — they don't need cross-team context.
- Dependency-specific degradations (fall back from Redis to DB; cached response on auth-service timeout) live at the service that has the dependency.
- Infrastructure-level runbooks (cloud-provider failover, region switch, storage pool migration) are per-fleet, not per-system.
The rule: playbooks that degrade customer-visible functionality are system-scope; playbooks that preserve customer-visible functionality by absorbing a failure are service-scope.
Relationship to distributed-monolith risk¶
If a system's services are so tightly coupled that every playbook is inherently system-scope — and no service can be mitigated independently — the system has distributed-monolith symptoms. The playbook scope is diagnostic: a healthy microservices architecture has a mix of system-scope and service-scope playbooks. All system-scope is a smell.
Seen in¶
- Zalando catalog playbooks (2023) — sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks — canonical explicit statement of the scope decision; catalog-pages playbook set is system-scope across the article-grid + outfit + sponsored-product + teaser services.
Related¶
- concepts/incident-playbook — the unit being scoped.
- concepts/playbook-ordering-by-business-impact — ordering operates within a system-scoped set.
- concepts/graceful-degradation — the property the system-scope playbook mitigations realise.
- concepts/critical-business-operation — business-owner's reasoning altitude; aligns with system scope.
- concepts/distributed-monolith — the anti-pattern where all playbooks end up system-scope.
- companies/zalando — axis 28.