Skip to content

PATTERN Cited by 3 sources

Fast rollback

Ability to revert a change to a known-good state quickly — ideally within seconds — without re-running the full CI/CD pipeline. Paired with staged-rollout to bound and then undo the blast radius of a bad change.

Design implications

  • Previous versions must be retrievable. The system must retain the last-known-good state (config value, binary, schema) and its associated metadata.
  • Rollback path cannot require the same approvals as rollout. If rollback goes through the same PR+review+CI path as forward rollout, it's not fast. Config platforms commonly expose a UI "emergency flow" that skips normal gates — trading audit-strictness for minutes-to-mitigate.
  • Auditability survives the bypass. Emergency rollbacks should still be logged and reviewable; "auditable after the fact" is the usual compromise.
  • Idempotent, safe re-apply. If rollback is triggered mid-rollout, the system should handle subscribers at mixed states correctly.

Seen in

  • sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar treats fast rollback as a first-class control-plane feature. Each rollout stage evaluates and can trigger rollback; for true emergencies, sitar-portal offers a UI bypass of CI/CD, with full audit logs.
  • sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — two-phase network-fleet instance. BGP re-announcement on revert at 22:20 UTC was near-instant — prefixes were back in the global routing table almost immediately. Server-side IP bindings were slow-by-design: ~23% of edge servers had been reconfigured to drop bindings during the outage and had to go back through the change-management progressive rollout, normally a multi-hour operation. Cloudflare accelerated the rollout after validating in testing locations, restoring normal traffic at 22:54 UTC.
  • sources/2026-01-19-cloudflare-what-came-first-the-cname-or-the-a-record — single-commit single-path instance. 8 min from incident-declaration (18:19 UTC) to revert-start (18:27 UTC) on the systems/cloudflare-1-1-1-1-resolver|1.1.1.1 CNAME-ordering regression. Full fleet-wide revert took 88 more minutes (18:27 → 19:55 UTC) because the revert still had to propagate through Cloudflare's normal change-management pipeline. Fast declaration-to-revert enabled by the change being a single-commit refactor (PartialChain::fill_cache in the cache-merge path) on a single code path; contrast to the 2025-07-14 1.1.1.1 outage where the revert had to also back out in-flight server-side IP- binding changes.
Last updated · 200 distilled / 1,178 read