SYSTEM Cited by 1 source
Netflix Chaos Monkey¶
Chaos Monkey is the founding member of Netflix's Simian Army and the canonical fault-injection tool for cloud-native resilience testing. It randomly disables production instances to verify that the architecture survives the most common class of cloud failure — individual node loss — without customer impact. Introduced in Netflix's 2011 TechBlog post (Source: sources/2026-01-02-netflix-the-netflix-simian-army).
Purpose¶
"A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact." The name invokes "the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption."
Design principles¶
- Run in business hours under engineer supervision. "By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won't even notice." See patterns/continuous-fault-injection-in-production.
- Random selection, not targeted. Chaos Monkey does not pick a specific instance based on a hypothesis; it picks randomly, exercising the expectation that any instance is safe to lose. See concepts/random-instance-failure-injection.
- Production-only, not staging. The whole premise is that the drill happens in the real environment with real traffic; a staging drill doesn't validate production graceful degradation.
Rollout posture (from the 2011 post)¶
The 2011 announcement does not disclose:
- Instance-kill rate / schedule.
- Opt-in vs. opt-out for service teams.
- Blast-radius bounds (limits per ASG / cluster / region).
- Kill-switch or pause mechanism.
Later Netflix work (and the 2012 open-source release) fill these in. This stub captures the 2011 design intent only; deeper architectural details should live on future ingests of later Netflix posts about Chaos Monkey v2 / SimianArmy / FIT / ChAP.
The flat-tire analogy¶
"Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud." Chaos Monkey is the scheduled driveway flat tire.
Relationship to Chaos Gorilla¶
systems/netflix-chaos-gorilla is the same idea scaled up a failure-domain level: where Chaos Monkey takes out a single instance, Chaos Gorilla takes out an entire Availability Zone. The pair validates the two most common failure-domain boundaries on AWS at the time: instance and AZ.
Operational numbers¶
None disclosed in the 2011 post.
Seen in¶
- sources/2026-01-02-netflix-the-netflix-simian-army — the canonical founding reference.