PATTERN Cited by 4 sources
Fast rollback¶
Ability to revert a change to a known-good state quickly — ideally within seconds — without re-running the full CI/CD pipeline. Paired with staged-rollout to bound and then undo the blast radius of a bad change.
Design implications¶
- Previous versions must be retrievable. The system must retain the last-known-good state (config value, binary, schema) and its associated metadata.
- Rollback path cannot require the same approvals as rollout. If rollback goes through the same PR+review+CI path as forward rollout, it's not fast. Config platforms commonly expose a UI "emergency flow" that skips normal gates — trading audit-strictness for minutes-to-mitigate.
- Auditability survives the bypass. Emergency rollbacks should still be logged and reviewable; "auditable after the fact" is the usual compromise.
- Idempotent, safe re-apply. If rollback is triggered mid-rollout, the system should handle subscribers at mixed states correctly.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — Fully-automated altitude variant. Slack's 18-month Deploy Safety Program canonicalises fast rollback as the rollback-mechanism sub-requirement for the [[patterns/automated-detect-remediate-within-10-minutes|10-min automated MTTR]] goal. "What we needed was automatic instead of manual remediation. Once automatic rollbacks were introduced we observed dramatic improvement in results." The fully-automated altitude — where the rollback trigger is metric-alarm-driven, not human-judgement-driven — contrasts with Airbnb Sitar's human-mediated UI emergency-bypass variant. Load-bearing on Slack's 90% reduction in customer impact hours from change-triggered incidents. Pattern was generalised from Webapp backend to Webapp frontend, then systematised across substrates via the centralised deployment orchestration system inspired by ReleaseBot + AWS Pipelines.
- sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar
treats fast rollback as a first-class control-plane feature. Each
rollout stage evaluates and can trigger rollback; for true emergencies,
sitar-portaloffers a UI bypass of CI/CD, with full audit logs. - sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — two-phase network-fleet instance. BGP re-announcement on revert at 22:20 UTC was near-instant — prefixes were back in the global routing table almost immediately. Server-side IP bindings were slow-by-design: ~23% of edge servers had been reconfigured to drop bindings during the outage and had to go back through the change-management progressive rollout, normally a multi-hour operation. Cloudflare accelerated the rollout after validating in testing locations, restoring normal traffic at 22:54 UTC.
- sources/2026-01-19-cloudflare-what-came-first-the-cname-or-the-a-record
— single-commit single-path instance. 8 min from
incident-declaration (18:19 UTC) to revert-start (18:27 UTC) on
the systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering
regression. Full fleet-wide revert took 88 more minutes
(18:27 → 19:55 UTC) because the revert still had to propagate
through Cloudflare's normal change-management pipeline. Fast
declaration-to-revert enabled by the change being a single-commit
refactor (
PartialChain::fill_cachein the cache-merge path) on a single code path; contrast to the 2025-07-14 1.1.1.1 outage where the revert had to also back out in-flight server-side IP- binding changes.
Related¶
- patterns/staged-rollout
- patterns/progressive-configuration-rollout
- patterns/emergency-bypass
- patterns/automated-detect-remediate-within-10-minutes
- patterns/centralised-deployment-orchestration-across-systems
- systems/slack-deploy-safety-program
- systems/slack-releasebot
- concepts/feedback-control-loop-for-rollouts