PATTERN Cited by 2 sources

Continuous fault injection in production¶

Run fault-injection drills in production, during business hours, under engineer supervision — as a continuous, automated, observable practice — rather than as a one-off game day in a staging environment. The cloud makes the drill almost free; business-hours scheduling makes post-drill learning possible.

The pattern¶

Netflix's 2011 TechBlog post on the Simian Army states the discipline directly:

"By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice."

Three composable rules:

Induce failure in production, not staging. Only production has the real traffic patterns, real dependency graph, real ASG capacity, and real observability stack.
Induce failure during business hours, not off-hours. The failure will happen off-hours anyway; the drill should happen when engineers can observe, debug, and architect the fix.
Run continuously, not as a one-off event. Once-a-year game days don't catch regressions introduced in the intervening year. The discipline is about catching drift.

Why business hours¶

Engineers are present to interpret unexpected failures and drive the architectural fix.
Observability is attended — graphs are being watched, not just alerting.
On-call cost is lower — at 3am the responder is half-asleep; at 2pm they're up.
Cross-team coordination is possible — a drill that surfaces a cross-team dependency can be fixed same-day, not filed as a Monday-morning ticket.

Why production¶

Staging doesn't have real traffic patterns — the dependency failure mode might not manifest under synthetic load.
Staging doesn't have real dependency counts — fewer downstream consumers means fewer timeout-cascade opportunities.
Staging doesn't have real ASG / LB behaviour under load — cloud provider rebalancing dynamics only show at production scale.
Staging drills don't validate production resilience — they validate the drill's correctness, not the production architecture's correctness.

Preconditions¶

Running fault injection continuously in production is safe only if:

The fleet is designed for graceful degradation (concepts/graceful-degradation).
Blast radius (concepts/blast-radius) is bounded — a misbehaving chaos agent cannot cause a cascading outage.
Observability + auto-abort catches customer-visible degradation and halts the drill.
Service teams have opted in (or opt-out mechanism exists with known consequences).
A kill-switch pauses all chaos agents globally in seconds.

Without these, "continuous fault injection in production" is indistinguishable from "self-induced outage."

Canonical instances¶

systems/netflix-chaos-monkey — random instance termination, scheduled in business hours, with engineers on hand.
systems/netflix-chaos-gorilla — the AZ-level version; higher blast radius, less frequent, but same discipline.
systems/netflix-latency-monkey — dependency-degradation injection, same operational posture.

Contrast: game days¶

A game day is a scheduled, human-operated, announced chaos drill — typically involving a cross-team tabletop + actual failure injection. Continuous fault injection in production is the automated, ongoing cousin. The two are complementary:

Game days cover complex, low-frequency, high-coordination scenarios (region loss, DNS outage, backup-restore exercise).
Continuous fault injection covers the routine, per-service failure modes.

Netflix's Simian Army is the continuous kind; AWS Game Days and Chaos Kong run-throughs are the episodic kind.

Seen in¶

sources/2026-01-02-netflix-the-netflix-simian-army — the canonical founding reference. "Run Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by."
sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — Canonical wiki instance of continuous fault-injection on a managed-Postgres release-gate. Verbatim: "Every Lakebase release goes through failure injection and chaos testing before it goes to production. We deploy the release to a real cluster, drive it with a mix of agentic and non-agentic OLTP and OLAP workloads at stress-level concurrency, and then start breaking things underneath. We kill processes, shoot down nodes, inject network failures, wipe disk contents, and restart components in loops, all while the workload keeps running." Three structural extensions vs Netflix's Simian-Army-shape: (a) failpoint-driven in-code injection (concepts/failpoint) — distinct from external Chaos Monkey-style instance kills, allows targeting "a crash at the worst possible time"; (b) cluster-wide fault coordination scoped to one cell — concepts/cell-based-architecture gives the chaos-regime a natural blast-radius unit, and the Lakebase fault-injection framework can "coordinate cluster-wide faults across an entire cell"; (c) paired correctness validators — SQLancer + SQLsmith
internal tools run during fault injection so postcondition violations (lost committed transactions / inconsistent state) are detected, not just crashes. The drill is release-gate rather than continuous-production — every Lakebase release goes through the regime before being promoted, complementing the Netflix continuous-in-production norm with a gate-before-rollout shape. Includes the next-level escalation (patterns/whole-az-network-partition-drill) and the 30-second per-database outage target as the drill's pass criterion.