CONCEPT Cited by 1 source

Random instance failure injection¶

Random instance failure injection is the primitive operation of chaos engineering: pick a production instance at random and kill it, then verify the fleet survives without customer impact. Canonical implementation: Netflix's Chaos Monkey (2011).

The primitive¶

"A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact." — Netflix, The Netflix Simian Army (2011).

Three essential properties:

Random selection — not targeted. The drill exercises the architectural expectation that any instance is safe to lose, not just the one the operator thinks is safest.
Production environment — not staging. The drill has to exercise real traffic, real dependencies, real ASG behaviour.
Real termination — the instance actually dies (not a simulated kill). Partial kills leave the fleet in an unrepresentative state.

Why random over targeted¶

Targeted fault injection tests a specific hypothesis ("we think losing this one instance is safe because …"). Random fault injection tests a fleet-wide invariant ("we claim losing any instance is safe"). The invariant is strictly stronger.

Targeted injection is still useful — it's what later Netflix tools (FIT, ChAP) and the Principles of Chaos Engineering literature emphasise for hypothesis-driven drills. But the random primitive is the foundational one, and the one Chaos Monkey implements.

Failure-mode coverage¶

Random instance kill covers the most common cloud failure mode (EC2 instance health degradation / hardware failure / maintenance event) but does not cover:

AZ-level failure — requires concepts/availability-zone-failure-drill / systems/netflix-chaos-gorilla.
Dependency degradation — requires latency injection (see systems/netflix-latency-monkey).
Grey failure (concepts/grey-failure) — instances that are degraded but not dead; the binary instance-kill primitive doesn't reproduce the partial-health failure mode.
Control-plane failure — random-instance-kill doesn't exercise failure of the orchestration plane itself.

Random instance kill is the first primitive, not the only one. The Simian Army shape is the generalisation — a family of failure-mode primitives, each narrowly scoped.

Operational discipline¶

Business hours, under observation. Netflix's posture: the drill runs when engineers can respond. See patterns/continuous-fault-injection-in-production.
Opt-in + kill-switch. A sane injection platform lets services opt out and provides a global pause; the 2011 Chaos Monkey post doesn't describe these, but they are preconditions for safe production drills.
Graceful degradation as prerequisite. The fleet's expectation that losing any instance is non-impactful has to be architecturally true before the monkey runs in production. See concepts/graceful-degradation.

Seen in¶

systems/netflix-chaos-monkey — the canonical tool.
sources/2026-01-02-netflix-the-netflix-simian-army — the canonical foundational reference.