CONCEPT Cited by 1 source
Random instance failure injection¶
Random instance failure injection is the primitive operation of chaos engineering: pick a production instance at random and kill it, then verify the fleet survives without customer impact. Canonical implementation: Netflix's Chaos Monkey (2011).
The primitive¶
"A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact." — Netflix, The Netflix Simian Army (2011).
Three essential properties:
- Random selection — not targeted. The drill exercises the architectural expectation that any instance is safe to lose, not just the one the operator thinks is safest.
- Production environment — not staging. The drill has to exercise real traffic, real dependencies, real ASG behaviour.
- Real termination — the instance actually dies (not a simulated kill). Partial kills leave the fleet in an unrepresentative state.
Why random over targeted¶
Targeted fault injection tests a specific hypothesis ("we think losing this one instance is safe because …"). Random fault injection tests a fleet-wide invariant ("we claim losing any instance is safe"). The invariant is strictly stronger.
Targeted injection is still useful — it's what later Netflix tools (FIT, ChAP) and the Principles of Chaos Engineering literature emphasise for hypothesis-driven drills. But the random primitive is the foundational one, and the one Chaos Monkey implements.
Failure-mode coverage¶
Random instance kill covers the most common cloud failure mode (EC2 instance health degradation / hardware failure / maintenance event) but does not cover:
- AZ-level failure — requires concepts/availability-zone-failure-drill / systems/netflix-chaos-gorilla.
- Dependency degradation — requires latency injection (see systems/netflix-latency-monkey).
- Grey failure (concepts/grey-failure) — instances that are degraded but not dead; the binary instance-kill primitive doesn't reproduce the partial-health failure mode.
- Control-plane failure — random-instance-kill doesn't exercise failure of the orchestration plane itself.
Random instance kill is the first primitive, not the only one. The Simian Army shape is the generalisation — a family of failure-mode primitives, each narrowly scoped.
Operational discipline¶
- Business hours, under observation. Netflix's posture: the drill runs when engineers can respond. See patterns/continuous-fault-injection-in-production.
- Opt-in + kill-switch. A sane injection platform lets services opt out and provides a global pause; the 2011 Chaos Monkey post doesn't describe these, but they are preconditions for safe production drills.
- Graceful degradation as prerequisite. The fleet's expectation that losing any instance is non-impactful has to be architecturally true before the monkey runs in production. See concepts/graceful-degradation.
Seen in¶
- systems/netflix-chaos-monkey — the canonical tool.
- sources/2026-01-02-netflix-the-netflix-simian-army — the canonical foundational reference.