Skip to content

SYSTEM Cited by 1 source

Netflix Simian Army

The Simian Army is Netflix's family of automated agents — each with a narrow failure-mode or abnormal-condition domain — that continuously exercise and validate the fault-tolerance of Netflix's AWS production environment. Introduced in the 2011 TechBlog post "The Netflix Simian Army" by Yury Izrailevsky and Ariel Tseitlin (Source: sources/2026-01-02-netflix-the-netflix-simian-army), the Simian Army is the canonical origin of chaos engineering as a production discipline.

Design shape

The Simian Army is a fleet of narrowly-focused agents rather than one monolithic fault-injection engine. Each simian owns one dimension:

See patterns/simian-army-shape for the generalised pattern.

Design principles (from the 2011 post)

  • "Constantly test our ability to actually survive these 'once in a blue moon' failures." Designing for fault tolerance is not enough; the design has to be exercised in production continuously.
  • Induce failure in business hours, with engineers on hand. "By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them." See patterns/continuous-fault-injection-in-production.
  • The cloud makes drills almost free. The flat-tire analogy: practising your spare-tire change in the driveway on Sunday is expensive in the real world, "but can be (almost) free and automated in the cloud." Chaos engineering is a cloud-native discipline because the test itself has near-zero cost.
  • Graceful degradation is a prerequisite. "We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments." The Simian Army exercises paths the architecture must already have. See concepts/graceful-degradation.

Vocabulary overload

The Simian Army groups two different kinds of agent under one umbrella:

  • Fault injectors: Chaos Monkey, Latency Monkey, Chaos Gorilla — actively create failure conditions.
  • Drift detectors: Conformity Monkey, Security Monkey, Janitor Monkey, Doctor Monkey, 10-18 Monkey — detect abnormal conditions and correct them (usually by terminating the offending instance).

The unifying abstraction is "automated agent that enforces a property of the production environment" — whether the property is "must survive instance loss" or "must conform to ASG policy" or "must have valid certificates."

History and maturity

The 2011 post explicitly notes that at time of publication "parts of the Simian Army have already been built, but much remains an aspiration — waiting for talented engineers to join the effort and make it a reality." Later Netflix work — and the open-sourcing of Chaos Monkey in 2012 — filled in the implementation. The vocabulary predates the 2016 coining of the term "chaos engineering" as a discipline; the Simian Army is the practice the discipline was later named after.

Operational numbers

None disclosed in the 2011 post.

Open-source lineage

Chaos Monkey (2012), SimianArmy (the broader repo), later the FIT (Failure Injection Testing) and ChAP (Chaos Automation Platform) layers that succeeded Simian Army for targeted failure injection at Netflix. The 2011 post is the taxonomic ancestor of all of them.

Seen in

Last updated · 319 distilled / 1,201 read