Skip to content

CONCEPT Cited by 1 source

Drill muscle memory

An operational-readiness discipline where an organisation regularly exercises its emergency procedures — break-glass authorisation, backup communication channels, failover procedures, disaster-recovery scripts — so that when an incident strikes, operators can execute the procedure under pressure without having to consult docs, re-learn the tooling, or discover that the procedure itself has drifted.

The phrase "muscle memory" is Cloudflare's vocabulary from the 2026-05-01 Code Orange: Fail Small is complete post:

While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure.

Why automation isn't sufficient

Automated tests and CI can verify that the emergency pathway works in the abstract — the backup-authorisation flow grants access, the emergency script produces the right output, the failover script fails traffic over. They cannot verify:

  • The operators know the pathway exists under incident pressure.
  • The operators can find the documentation when their normal dashboards are down.
  • The operators recognise the signal that the normal pathway has failed and the emergency pathway is needed.
  • The organisational coordination around the emergency (paging, decision authority, comms) actually functions.
  • The handoffs across time zones and teams work when someone is unavailable.

All of those are human-in-the-loop failure modes that only exercise can surface. The drill is the test for the organ- isational readiness, not the procedure itself.

Two tiers of drill

  1. Small-team exercises. A single team practises its runbook against a scenario. Catches team-local gaps (docs out of date, tooling broken, contacts stale).
  2. Engineering-wide drill. 100+ team members simulate a large-scale incident. Catches cross-team gaps (unclear ownership, handoff frictions, comms cadence) that small-team exercises don't surface.

Cloudflare's Code Orange drill sequence went through both:

After small-team exercises, we conducted an engineering- wide drill on April 7, 2026, involving more than 200 team members.

Drills are not real incidents

A load-bearing property: the drill does not cause customer impact. A well-designed backup-authorisation pathway + well-designed drill + fail-open module behaviour (see concepts/fail-stale) should produce a no-op at the customer tier while still exercising the operator muscle memory. If the drill risks an actual outage, the organisation is drilling the wrong thing — the drill itself should be safe to run.

Sibling framing: concepts/chaos-engineering runs controlled failure injections as a production-latent-hazard-detection discipline. Drill-muscle-memory is the human-readiness discipline that composes with it — chaos injects the event; the drill is how operators practice responding.

The alternative is latent hazard

Emergency pathways that have never been exercised are latent: they look right on paper but may not work when invoked, and the first invocation discovers the bugs. The 2025-12-05 Cloudflare incident is a canonical wiki instance of a rulesets-engine killswitch path detonating the first time it was exercised — a seven-year-old dormant code path that would have been found by the first drill if one had run.

Canonical wiki instance

sources/2026-05-01-cloudflare-code-orange-fail-small-complete — Cloudflare's Code Orange programme concluded with a 2026-04-07 engineering-wide drill involving 200+ team members, preceded by small-team exercises. The drill stress-tested the backup-authorisation pathways across 18 key services. The "muscle memory" framing is explicit in the post.

Seen in

Last updated · 445 distilled / 1,275 read