Skip to content

PATTERN Cited by 1 source

Automated detect-remediate within 10 minutes

Problem

Customer interruptions become qualitatively worse past a threshold — for Slack, "about 10 minutes" — at which point customers stop treating the event as a "blip" and start treating it as an outage. Any change-triggered incident that resolves in under 10 minutes is below the customer's disruption threshold; any incident that lasts longer is above it.

A mean-time-to-remediate (MTTR) target of 10 minutes is not achievable with human-mediated remediation — by the time an on-call engineer is paged, reads the alert, logs in, identifies the change, and triggers rollback, 10 minutes is gone. Human remediation operates at a 20-30 minute floor for change-triggered incidents at best.

Solution

Compose three substrate pieces to drive MTTR below 10 minutes for the automated path:

  1. Continuous deploy-rollout metric monitoring — fast (seconds-to-minutes), sensitive, customer-aligned signals watched per deploy phase.
  2. Automated rollback decision — the monitoring system triggers rollback on metric-alarm, without waiting for human on-call engagement.
  3. Fast rollback mechanism — the patterns/fast-rollback substrate that can revert to the last-known-good state within seconds once triggered.

Pair this with a 20-minute target for the manual path, for cases where automation is not available (first deploys of new services, services not yet onboarded to the automated pipeline, decisions that require human judgement).

Canonical disclosure

Slack's 2025-10-07 Deploy Safety retrospective canonicalises the pattern with verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

*"Reducing impact time from deployments

  • Automated detection & remediation within 10 minutes
  • Manual detection & remediation within 20 minutes"*

Motivation verbatim:

"we also received customer feedback that interruptions became more disruptive after about 10 minutes – something they would treat as a 'blip'"

Evidence of the automation vs manual delta verbatim:

"What we needed was automatic instead of manual remediation. Once automatic rollbacks were introduced we observed dramatic improvement in results."

Composition structure

The 10/20 minute pair decomposes MTTR into two regimes with different capability requirements:

10-minute automated regime (fully-instrumented services):

  • Metric monitoring infrastructure in place, pre-deploy.
  • Deploy phases fine-grained enough that signal appears within a few minutes.
  • Rollback trigger wired to metric alarm.
  • Rollback mechanism tested and routine.

20-minute manual regime (human-in-the-loop services):

  • Incident detection may be on-call page, customer report, or dashboard check.
  • Rollback decision requires human judgement.
  • Rollback mechanism must be fast enough that 20 minutes is achievable.
  • Manual-rollback tooling must be usable — see the fluency requirement below.

Required sub-investments

Slack's disclosed investment sequence for Webapp backend, which reached the 10-minute auto regime:

  1. Q1: "Engineer automatic metric monitoring."
  2. Q2: "Confirm customer-impact alignment via automatic alerts and manual rollback actions." (20-min manual regime first.)
  3. Q3-Q4: "Invest in automatic deployments and rollback."
  4. Q4+: "Prove success with many automatic rollbacks keeping customer impact below 10 minutes."

The ordering matters: monitoring first, manual-rollback second, automatic-rollback third. Trying to wire automatic rollback before proving the alert→manual-rollback loop works leads to automation-that-over-fires or automation-that-misses.

Fluency requirement

Slack's retrospective explicitly calls out that the manual substrate must be exercised often, not reserved for incidents. Verbatim:

"Use the tooling often, not just for the infrequent worst case scenarios. Incidents are stressful and we found that without frequent use to build fluency, confidence, and comfort the processes and tools won't become routine. It would be as if you didn't build the tool/capability in the first place."

This is a canonical sibling to patterns/always-be-failing-over-drill (PlanetScale failover drills) and concepts/chaos-engineering — the underlying discipline is practice-the-recovery-path- often-enough-that-it-becomes-routine.

Relationship to wiki primitives

Operational numbers disclosed

Slack's retrospective reports a 90% reduction in customer impact hours from change-triggered incidents from peak (Feb-Apr 2024) to Jan 2025, with the automatic-rollback introduction named as the load-bearing "dramatic" inflection point. The per-project delta for the 10-min auto target specifically is not disclosed.

Caveats

  • The 10-minute number is organisation-specific. Slack's customer-feedback research named 10 minutes; another product with different customer expectations might have 5 minutes (live collaboration) or 30 minutes (async tooling).
  • Automated rollback is not always safe. Schema migrations, data writes, and state changes may not be rollback-able without data loss. The 10-min auto target applies to deploys where rollback is reversible; other changes need compensating mechanisms (feature flags, backward-compatibility, expand/ migrate/contract — see patterns/expand-migrate-contract).
  • Rollback can itself cause incidents. A rollback to a previous version that has since accumulated forward-only state dependencies can trigger a second incident.
  • MTTR-below-10-min targets must be validated. A program that claims 10-min MTTR needs to track the actual distribution, not just the mean; a bimodal distribution with half of incidents at 2 minutes and half at 30 minutes still has a mean under 20 but a median above 10.
  • Automation has its own failure modes. The auto-remediation path must itself be monitored — a rollback pipeline that silently stops working because of a config change is worse than a working manual path.

Seen in

Last updated · 470 distilled / 1,213 read