Skip to content

CONCEPT Cited by 1 source

Random instance failure injection

Random instance failure injection is the primitive operation of chaos engineering: pick a production instance at random and kill it, then verify the fleet survives without customer impact. Canonical implementation: Netflix's Chaos Monkey (2011).

The primitive

"A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact." — Netflix, The Netflix Simian Army (2011).

Three essential properties:

  • Random selection — not targeted. The drill exercises the architectural expectation that any instance is safe to lose, not just the one the operator thinks is safest.
  • Production environment — not staging. The drill has to exercise real traffic, real dependencies, real ASG behaviour.
  • Real termination — the instance actually dies (not a simulated kill). Partial kills leave the fleet in an unrepresentative state.

Why random over targeted

Targeted fault injection tests a specific hypothesis ("we think losing this one instance is safe because …"). Random fault injection tests a fleet-wide invariant ("we claim losing any instance is safe"). The invariant is strictly stronger.

Targeted injection is still useful — it's what later Netflix tools (FIT, ChAP) and the Principles of Chaos Engineering literature emphasise for hypothesis-driven drills. But the random primitive is the foundational one, and the one Chaos Monkey implements.

Failure-mode coverage

Random instance kill covers the most common cloud failure mode (EC2 instance health degradation / hardware failure / maintenance event) but does not cover:

Random instance kill is the first primitive, not the only one. The Simian Army shape is the generalisation — a family of failure-mode primitives, each narrowly scoped.

Operational discipline

  • Business hours, under observation. Netflix's posture: the drill runs when engineers can respond. See patterns/continuous-fault-injection-in-production.
  • Opt-in + kill-switch. A sane injection platform lets services opt out and provides a global pause; the 2011 Chaos Monkey post doesn't describe these, but they are preconditions for safe production drills.
  • Graceful degradation as prerequisite. The fleet's expectation that losing any instance is non-impactful has to be architecturally true before the monkey runs in production. See concepts/graceful-degradation.

Seen in

Last updated · 319 distilled / 1,201 read