Skip to content

PATTERN Cited by 4 sources

Fast rollback

Ability to revert a change to a known-good state quickly — ideally within seconds — without re-running the full CI/CD pipeline. Paired with staged-rollout to bound and then undo the blast radius of a bad change.

Design implications

  • Previous versions must be retrievable. The system must retain the last-known-good state (config value, binary, schema) and its associated metadata.
  • Rollback path cannot require the same approvals as rollout. If rollback goes through the same PR+review+CI path as forward rollout, it's not fast. Config platforms commonly expose a UI "emergency flow" that skips normal gates — trading audit-strictness for minutes-to-mitigate.
  • Auditability survives the bypass. Emergency rollbacks should still be logged and reviewable; "auditable after the fact" is the usual compromise.
  • Idempotent, safe re-apply. If rollback is triggered mid-rollout, the system should handle subscribers at mixed states correctly.

Seen in

  • sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-changeFully-automated altitude variant. Slack's 18-month Deploy Safety Program canonicalises fast rollback as the rollback-mechanism sub-requirement for the [[patterns/automated-detect-remediate-within-10-minutes|10-min automated MTTR]] goal. "What we needed was automatic instead of manual remediation. Once automatic rollbacks were introduced we observed dramatic improvement in results." The fully-automated altitude — where the rollback trigger is metric-alarm-driven, not human-judgement-driven — contrasts with Airbnb Sitar's human-mediated UI emergency-bypass variant. Load-bearing on Slack's 90% reduction in customer impact hours from change-triggered incidents. Pattern was generalised from Webapp backend to Webapp frontend, then systematised across substrates via the centralised deployment orchestration system inspired by ReleaseBot + AWS Pipelines.
  • sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar treats fast rollback as a first-class control-plane feature. Each rollout stage evaluates and can trigger rollback; for true emergencies, sitar-portal offers a UI bypass of CI/CD, with full audit logs.
  • sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — two-phase network-fleet instance. BGP re-announcement on revert at 22:20 UTC was near-instant — prefixes were back in the global routing table almost immediately. Server-side IP bindings were slow-by-design: ~23% of edge servers had been reconfigured to drop bindings during the outage and had to go back through the change-management progressive rollout, normally a multi-hour operation. Cloudflare accelerated the rollout after validating in testing locations, restoring normal traffic at 22:54 UTC.
  • sources/2026-01-19-cloudflare-what-came-first-the-cname-or-the-a-record — single-commit single-path instance. 8 min from incident-declaration (18:19 UTC) to revert-start (18:27 UTC) on the systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering regression. Full fleet-wide revert took 88 more minutes (18:27 → 19:55 UTC) because the revert still had to propagate through Cloudflare's normal change-management pipeline. Fast declaration-to-revert enabled by the change being a single-commit refactor (PartialChain::fill_cache in the cache-merge path) on a single code path; contrast to the 2025-07-14 1.1.1.1 outage where the revert had to also back out in-flight server-side IP- binding changes.
Last updated · 542 distilled / 1,571 read