Skip to content

CONCEPT Cited by 1 source

Pre-approved degradation procedure

A pre-approved degradation procedure is a mitigation step whose business-impact trade-off has been judged and approved by the business owner of the affected functionality in advance of any incident, so that the on-call responder can execute it during an outage without real-time cross-functional decision making. The pre-approval is the load-bearing artefact: it turns a "should we do this?" question (slow, cross-timezone, cross- team) into a "do this" instruction (fast, single-owner).

Definition

The canonical framing from Zalando:

"These emergency procedures are pre-approved by the respective Business Owner of the underlying functionality, allowing for quicker incident response without the need for explicit decision making while critical issues are ongoing."sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks

The semantic move: approval is a property of the degradation option, not of each incident. The business owner pre-approves "disable outfits from catalog pages when Elasticsearch outfit- query CPU > 90%" once; every future incident that matches the trigger executes without re-asking.

Why pre-approval matters

Real-time approval during an incident is the biggest hidden latency in incident response. Three failure modes:

  1. Responder can't reach the business owner — 3 AM, different timezone, phone off. The outage continues while pages go unanswered.
  2. Business owner doesn't have the context"what's the impact if we do this?" is a question the responder is supposed to answer, but the responder is also mid-incident. Context-gathering takes longer than the incident can afford.
  3. Nobody wants to own the decision — cross-team accountability for a degradation is diffuse. "Should we turn off sponsored products?" requires the product team, the monetisation team, the platform team, and an exec in-principle to all agree.

Pre-approval collapses all three into a document reviewed at design-time, when the analysis is not under outage-clock pressure.

What the business owner actually approves

The approval is scoped to the specific trigger + business- impact pair:

  • Trigger — the observable condition that justifies executing (e.g. "metrics-ingestion SLO at risk of breach").
  • Business impact — what the customer experiences (e.g. "outfits won't be shown on catalog pages", "user dashboards will show tier-1 metrics only").
  • MTTR — how long the degradation lasts (both how fast it takes effect and how fast it's reversed after).

The approval is not a blank check to degrade any time — it's bounded to the trigger. If the trigger doesn't fire, the procedure isn't executed. The business owner is approving "in the circumstance where X is the only alternative", not "whenever you feel like it".

Revocation and review

The business context changes over time (new product launches, new revenue contracts, new regulatory obligations) so the pre-approval has to be periodically reconfirmed. Zalando's mechanism is the expiry date on each playbook: once it expires, the team is nudged to re-run the approval with the current business owner. Implicit in this is that approvals don't bind the business indefinitely — the playbook is a renewable authorisation, not a permanent grant.

Scope-matching

The responder's authority to execute a pre-approved degradation is only as wide as the owner's scope. For a system-wide playbook (system-level scope) covering multiple microservices, the approver is typically the senior owner of the user-facing business capability ("head of catalog experience"), not the individual service-team leads. Same altitude-matching as the top-down SLO ownership rule on operation-based SLOs — the authorisation scope has to match the business-impact scope, or the approval doesn't cover what the playbook actually does.

Relationship to operational concepts

  • Graceful degradation (concepts/graceful-degradation) is the architectural property; pre-approval is the operational enabler. Without pre-approval, designed-in degradation hooks sit unused because the responder can't get permission in time.
  • Circuit breakers trip without human approval; pre-approved degradation procedures are the human-authorised layer above that, for the mitigations that a circuit breaker can't cover (feature-flag flips, module disables, capacity shedding).
  • Blast radius (concepts/blast-radius) is bounded by the pre-approval: the approver has consciously said yes to the approved radius, which means the responder can execute a larger-impact playbook (dropping a tier-2 feature fleet-wide) that would otherwise require emergency leadership input.
  • Alert fatigue (concepts/alert-fatigue) reduces because the page arrives with a pre-authorised response, not an open-ended troubleshooting task.

Seen in

  • Zalando incident playbooks (2019-present)sources/2023-01-30-zalando-how-we-manage-our-1200-incident-playbooks — 1,200+ playbooks, each pre-approved by the relevant Business Owner. Canonical wiki instance. Expiry-date-based renewal. Application-review workflow requires playbook presence for applications above a certain criticality tier.
Last updated · 550 distilled / 1,221 read