Skip to content

CONCEPT Cited by 1 source

Playbook ordering by business impact

Playbook ordering by business impact is the discipline of sorting the set of incident playbooks for a given system from least business impact → most business impact, so that an incident responder walking the set during a capacity / load / dependency outage applies the cheapest mitigations first and escalates to larger ones only if needed.

The Zalando catalog example

Zalando's catalog / product-listing pages pull data from multiple sources: article grid, outfit recommendations, sponsored products, teasers. Each source costs load on Elasticsearch + downstream APIs. The playbook set is ordered:

  1. Disable outfit calls — small UX impact ("outfits won't be shown as part of the catalog pages").
  2. Disable sponsored products — monetisation impact; next- tier business-impact.
  3. Disable teasers — editorial content impact; further down.
  4. (later steps degrade further if needed.)

The incident that validated the ordering: "In one of our evening Cyber Week shifts, we encountered performance degradation resulting in increased latencies, which was hard to diagnose. While one part of the team was busy troubleshooting the issue, another part of the team executed multiple of the prepared playbooks in sequence in order to mitigate the customer impact." (sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks)

Why ordering is a property of the set, not the playbook

Individual playbooks can't self-order — "disable outfits" doesn't know whether "disable sponsored products" exists as an alternative. The ordering is an editorial decision at the playbook-set level, made once by the team that owns the system. Canonical design: per-system playbook directories with an explicit preferred order in the index, reviewed at authoring time, refreshed when new playbooks are added.

The least-impact-first rule

The rule is: if two mitigations yield comparable operational impact, apply the smaller-business-impact one first. Three structural reasons:

  1. Reversibility is cheaper. If disable outfits suffices, the team never pays the disable sponsored products cost.
  2. Business-owner optics. Escalating degradations in order of visible impact signals discipline to stakeholders — you took the smallest hit available before taking a bigger one.
  3. Fail-forward information. If playbook 1 doesn't resolve the incident, that's diagnostic signal — the root cause is larger than the first mitigation addressed. Skipping to the biggest playbook first loses that signal.

Counter-example: time-critical triggers

When the trigger is a hard deadline (TSDB overflow at a specific load, lock timeout, quorum loss), ordering may be overridden by the fastest-effective-playbook-first rule — a playbook that costs more business impact but takes effect in 10 seconds is preferred over a smaller one that takes effect in 3 minutes. The Zalando ZMON example tilts this way: dropping non-critical metrics has zero business impact, so ordering and time-criticality agree; the general case may not.

When the rule is wrong

  • Unknown dependency order. If disabling a feature inadvertently increases load on a shared component (because it changes query patterns), the wrong-ordering risk is real. The fix is to validate ordering in game days — execute each playbook in sequence in a rehearsal environment and confirm operational impact matches the doc.
  • Regulatory / contractual floor. Some features can't be disabled without legal review (billing, consent). Those playbooks shouldn't be in the ordering at all; the floor goes in the trigger, not the sequence.
  • Correlated business impact. If disable sponsored and disable teasers together take out 80% of the revenue-driving UI but each alone is acceptable, the ordering has to surface that don't execute both constraint explicitly — the playbook set needs a "combined impact" annotation.

Seen in

Last updated · 550 distilled / 1,221 read