CONCEPT Cited by 1 source
Change-triggered incident rate¶
Definition¶
Change-triggered incident rate is the fraction of an organisation's customer-facing incidents that are caused by the organisation's own changes — code deploys, config rollouts, schema migrations, infra mutations — as opposed to external causes (hardware faults, network issues, dependency failures, traffic spikes, customer-induced overload).
The rate is computed per unit of impact (incidents, customer impact hours, severity-weighted minutes) and is the load-bearing statistic that justifies investing in deploy-safety infrastructure over alternative reliability axes.
Canonical disclosure¶
Slack's 2025-10-07 Deploy Safety retrospective canonicalises the metric with verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):
"The increasing majority (73%) of customer facing incidents were triggered by Slack-induced change, particularly code deploys."
The 73% ratio is the load-bearing input to Slack's subsequent investment decision — rather than invest in, e.g., capacity planning, chaos engineering, dependency hardening, or customer-throttling controls, Slack chose to invest in deploy-safety because the "majority" of impact flowed from self-induced change.
Why the rate matters¶
The distribution of incident causes governs where reliability investment should go. Three canonical shapes:
- Change-dominant (Slack 2023: 73%). Invest in deploy safety, rollback automation, metrics-based rollouts, staged rollouts.
- External-dominant (classical early-stack). Invest in resilience to dependency failure, graceful degradation, load shedding.
- Capacity-dominant. Invest in forecasting, overprovision, auto-scaling, traffic admission control.
Most organisations at scale drift change-dominant as their platform matures: external-dependency hardening is front-loaded during scale-up; capacity is load-shed / shed-early / auto- scaled; what's left is the organisation's own changes — which accumulate faster as feature velocity compounds.
Measurement considerations¶
- Classification ambiguity. "Change-triggered" requires incident post-hoc analysis. Some incidents trigger during deploy but would have happened anyway (latent bug surfaced by timing, not caused by code). Others are triggered by config change from a different team's pipeline — attributed to the customer deploy or the platform team?
- Causal depth. A deploy can be a proximate trigger for a latent capacity issue. Counting it as change-triggered over-credits deploy safety; counting it as capacity-driven under-credits it.
- Severity-weighting. Counting by incident count under-weights a single high-sev change-triggered outage against many low-sev infrastructure blips. Slack's choice was severity-filtered hours: "high severity and selected medium severity".
- Filter selectivity. Slack's "selected" medium-severity curation is a manual judgement call, which creates reproducibility questions across quarters.
Relationship to wiki primitives¶
- concepts/customer-impact-hours-metric — Slack's program metric composes change-triggered incident rate with severity-filtered customer-impact hours as the top-line metric.
- concepts/dora-metrics — DORA's change failure rate is the closely-adjacent industry metric (fraction of deploys causing incidents). Change-triggered-incident-rate is the inverse framing (fraction of incidents from deploys). DORA measures deploy-originated failure upstream of incidents; this metric measures incident origin downstream of failures.
- concepts/feedback-control-loop-for-rollouts — the canonical mitigation for the change-dominant cause distribution: measure during the change, halt when the signal moves.
- patterns/fast-rollback — the recovery capability that bounds the incident duration given a change has triggered one.
- patterns/staged-rollout — the blast-radius cap that bounds the incident severity of a change-triggered incident.
Program-metric implication¶
Choosing to track change-triggered-incident-rate (or its severity-weighted hours variant) as the program metric forces the organisation to maintain per-incident causal classification rigor. Slack's retrospective implicitly makes this discipline load-bearing on the program:
- Incident review must produce a causal attribution, reviewable after-the-fact.
- Severity level at incident open is intent-of-impact, not final-impact — the medium-severity filter requires "careful post-hoc analysis" per Slack.
- Change attribution requires incident-to-deploy correlation infrastructure (deploy markers on time series is the thin version; full-causal-chain reconstruction is the rigorous version).
Caveats¶
- Not a complete reliability metric. A change-triggered rate of 0% with a 1.0× change-incident-duration is worse than a change-triggered rate of 99% with a 0.01× duration.
- Survivorship bias in low-change-frequency teams. Teams that deploy less often have a naturally lower change-triggered-incident-count; interpreting this as "better" is the exact trap DORA warns against.
- Rate is not rate of progress. A 73% rate that becomes 50% could reflect more non-change incidents, not fewer change-triggered ones. Track both the rate and the absolute number.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — 73% of Slack's customer-facing incidents were change-triggered at program start (mid-2023); primary justification for the 18-month Deploy Safety Program.