Skip to content

PATTERN Cited by 1 source

Continuous fault injection in production

Run fault-injection drills in production, during business hours, under engineer supervision — as a continuous, automated, observable practice — rather than as a one-off game day in a staging environment. The cloud makes the drill almost free; business-hours scheduling makes post-drill learning possible.

The pattern

Netflix's 2011 TechBlog post on the Simian Army states the discipline directly:

"By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice."

Three composable rules:

  1. Induce failure in production, not staging. Only production has the real traffic patterns, real dependency graph, real ASG capacity, and real observability stack.
  2. Induce failure during business hours, not off-hours. The failure will happen off-hours anyway; the drill should happen when engineers can observe, debug, and architect the fix.
  3. Run continuously, not as a one-off event. Once-a-year game days don't catch regressions introduced in the intervening year. The discipline is about catching drift.

Why business hours

  • Engineers are present to interpret unexpected failures and drive the architectural fix.
  • Observability is attended — graphs are being watched, not just alerting.
  • On-call cost is lower — at 3am the responder is half-asleep; at 2pm they're up.
  • Cross-team coordination is possible — a drill that surfaces a cross-team dependency can be fixed same-day, not filed as a Monday-morning ticket.

Why production

  • Staging doesn't have real traffic patterns — the dependency failure mode might not manifest under synthetic load.
  • Staging doesn't have real dependency counts — fewer downstream consumers means fewer timeout-cascade opportunities.
  • Staging doesn't have real ASG / LB behaviour under load — cloud provider rebalancing dynamics only show at production scale.
  • Staging drills don't validate production resilience — they validate the drill's correctness, not the production architecture's correctness.

Preconditions

Running fault injection continuously in production is safe only if:

  • The fleet is designed for graceful degradation (concepts/graceful-degradation).
  • Blast radius (concepts/blast-radius) is bounded — a misbehaving chaos agent cannot cause a cascading outage.
  • Observability + auto-abort catches customer-visible degradation and halts the drill.
  • Service teams have opted in (or opt-out mechanism exists with known consequences).
  • A kill-switch pauses all chaos agents globally in seconds.

Without these, "continuous fault injection in production" is indistinguishable from "self-induced outage."

Canonical instances

Contrast: game days

A game day is a scheduled, human-operated, announced chaos drill — typically involving a cross-team tabletop + actual failure injection. Continuous fault injection in production is the automated, ongoing cousin. The two are complementary:

  • Game days cover complex, low-frequency, high-coordination scenarios (region loss, DNS outage, backup-restore exercise).
  • Continuous fault injection covers the routine, per-service failure modes.

Netflix's Simian Army is the continuous kind; AWS Game Days and Chaos Kong run-throughs are the episodic kind.

Seen in

Last updated · 319 distilled / 1,201 read