Skip to content

CONCEPT Cited by 1 source

Change-triggered incident rate

Definition

Change-triggered incident rate is the fraction of an organisation's customer-facing incidents that are caused by the organisation's own changes — code deploys, config rollouts, schema migrations, infra mutations — as opposed to external causes (hardware faults, network issues, dependency failures, traffic spikes, customer-induced overload).

The rate is computed per unit of impact (incidents, customer impact hours, severity-weighted minutes) and is the load-bearing statistic that justifies investing in deploy-safety infrastructure over alternative reliability axes.

Canonical disclosure

Slack's 2025-10-07 Deploy Safety retrospective canonicalises the metric with verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

"The increasing majority (73%) of customer facing incidents were triggered by Slack-induced change, particularly code deploys."

The 73% ratio is the load-bearing input to Slack's subsequent investment decision — rather than invest in, e.g., capacity planning, chaos engineering, dependency hardening, or customer-throttling controls, Slack chose to invest in deploy-safety because the "majority" of impact flowed from self-induced change.

Why the rate matters

The distribution of incident causes governs where reliability investment should go. Three canonical shapes:

  1. Change-dominant (Slack 2023: 73%). Invest in deploy safety, rollback automation, metrics-based rollouts, staged rollouts.
  2. External-dominant (classical early-stack). Invest in resilience to dependency failure, graceful degradation, load shedding.
  3. Capacity-dominant. Invest in forecasting, overprovision, auto-scaling, traffic admission control.

Most organisations at scale drift change-dominant as their platform matures: external-dependency hardening is front-loaded during scale-up; capacity is load-shed / shed-early / auto- scaled; what's left is the organisation's own changes — which accumulate faster as feature velocity compounds.

Measurement considerations

  • Classification ambiguity. "Change-triggered" requires incident post-hoc analysis. Some incidents trigger during deploy but would have happened anyway (latent bug surfaced by timing, not caused by code). Others are triggered by config change from a different team's pipeline — attributed to the customer deploy or the platform team?
  • Causal depth. A deploy can be a proximate trigger for a latent capacity issue. Counting it as change-triggered over-credits deploy safety; counting it as capacity-driven under-credits it.
  • Severity-weighting. Counting by incident count under-weights a single high-sev change-triggered outage against many low-sev infrastructure blips. Slack's choice was severity-filtered hours: "high severity and selected medium severity".
  • Filter selectivity. Slack's "selected" medium-severity curation is a manual judgement call, which creates reproducibility questions across quarters.

Relationship to wiki primitives

  • concepts/customer-impact-hours-metric — Slack's program metric composes change-triggered incident rate with severity-filtered customer-impact hours as the top-line metric.
  • concepts/dora-metrics — DORA's change failure rate is the closely-adjacent industry metric (fraction of deploys causing incidents). Change-triggered-incident-rate is the inverse framing (fraction of incidents from deploys). DORA measures deploy-originated failure upstream of incidents; this metric measures incident origin downstream of failures.
  • concepts/feedback-control-loop-for-rollouts — the canonical mitigation for the change-dominant cause distribution: measure during the change, halt when the signal moves.
  • patterns/fast-rollback — the recovery capability that bounds the incident duration given a change has triggered one.
  • patterns/staged-rollout — the blast-radius cap that bounds the incident severity of a change-triggered incident.

Program-metric implication

Choosing to track change-triggered-incident-rate (or its severity-weighted hours variant) as the program metric forces the organisation to maintain per-incident causal classification rigor. Slack's retrospective implicitly makes this discipline load-bearing on the program:

  • Incident review must produce a causal attribution, reviewable after-the-fact.
  • Severity level at incident open is intent-of-impact, not final-impact — the medium-severity filter requires "careful post-hoc analysis" per Slack.
  • Change attribution requires incident-to-deploy correlation infrastructure (deploy markers on time series is the thin version; full-causal-chain reconstruction is the rigorous version).

Caveats

  • Not a complete reliability metric. A change-triggered rate of 0% with a 1.0× change-incident-duration is worse than a change-triggered rate of 99% with a 0.01× duration.
  • Survivorship bias in low-change-frequency teams. Teams that deploy less often have a naturally lower change-triggered-incident-count; interpreting this as "better" is the exact trap DORA warns against.
  • Rate is not rate of progress. A 73% rate that becomes 50% could reflect more non-change incidents, not fewer change-triggered ones. Track both the rate and the absolute number.

Seen in

Last updated · 470 distilled / 1,213 read