Skip to content

NETFLIX Tier 1

Read original ↗

Netflix — The Netflix Simian Army (2011)

Netflix's 2011 post (republished 2026-01-02 on Medium) by Yury Izrailevsky (Director, Cloud & Systems Infrastructure) and Ariel Tseitlin (Director, Cloud Solutions) introduces the Simian Army — a family of automated agents that continuously induce failures and detect abnormal conditions in Netflix's AWS production environment, to prove the cloud architecture is genuinely resilient to the failures it claims to tolerate. This is the canonical foundational post on chaos engineering at Netflix — the Chaos Monkey was introduced here, along with seven sibling simians covering latency, conformity, health, cost/waste, security, localization, and availability-zone failure.

Summary

The thesis: "just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these 'once in a blue moon' failures." The cloud makes this testing "almost free and automated" — unlike physical-datacenter drills. Netflix built Chaos Monkey first, a tool that randomly disables production instances during business hours under engineer supervision, so weaknesses surface at 2pm on a Wednesday when people can respond, rather than at 3am on a Sunday. Inspired by that success, the team "started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available."

Eight named simians, each with a focused failure or abnormal- condition domain:

  • Chaos Monkey — randomly disables production instances.
  • Latency Monkey — injects artificial delays in RESTful client-server traffic; very large delays simulate node / service downtime without taking instances down.
  • Conformity Monkey — finds instances violating best practices (e.g. not in an auto-scaling group) and shuts them down.
  • Doctor Monkey — consumes health-check + external signals (CPU load, etc.) to detect unhealthy instances, removes them from service, and eventually terminates them.
  • Janitor Monkey — searches for and disposes of unused resources to keep the environment free of clutter and waste.
  • Security Monkey — an extension of Conformity Monkey focused on security violations (improperly configured AWS security groups, expiring SSL/DRM certificates); terminates offending instances.
  • 10-18 Monkey — short for l10n-i18n; detects configuration and runtime problems serving customers across geographic regions with different languages and character sets.
  • Chaos Gorilla — same idea as Chaos Monkey, but simulates the outage of an entire Amazon availability zone. Verifies Netflix's services re-balance to remaining AZs without user- visible impact or manual intervention.

The post is a declaration of design intent, not an implementation deep-dive. It explicitly notes that "parts of the Simian Army have already been built, but much remains an aspiration." No code, no architecture diagram, no numbers (no failure-injection rates, no MTTR data, no fleet size, no cost figures). Its importance is taxonomic: it names the full family of failure-mode agents before a single paper on chaos engineering as a discipline existed in the literature.

Key takeaways

  1. Testing fault tolerance continuously in production is the load-bearing claim. "We have to constantly test our ability to actually survive these 'once in a blue moon' failures." Designing for fault tolerance is not enough; the design has to be continuously exercised. See concepts/chaos-engineering. (Source: sources/2026-01-02-netflix-the-netflix-simian-army)

  2. Business-hours induction is a deliberate choice. Chaos Monkey runs "in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems." Failure at 3am on a Sunday is the adversary; failure at 2pm on a Wednesday is the drill. See patterns/continuous-fault-injection-in-production.

  3. Random instance-kill is the canonical fault-injection primitive. Chaos Monkey "randomly disables our production instances to make sure we can survive this common type of failure without any customer impact." Canonical instance: concepts/random-instance-failure-injection / systems/netflix-chaos-monkey.

  4. The flat-tire analogy is the philosophy. "One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud." The quote captures the economic inversion the cloud makes possible for fault-tolerance validation.

  5. Latency injection simulates service outage without instance loss. "By making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault- tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system." Latency Monkey decouples the tested service's dependency failure from the rest of the fleet's experience — a strictly more surgical tool than instance kill.

  6. Conformity + Security cover "abnormal-condition detection" rather than active fault induction. Instances are terminated for being outside policy (not in an auto-scaling group; open security group; expiring cert). This overloads the Simian Army vocabulary — not every monkey injects a failure; some detect and correct drift. See systems/netflix-conformity-monkey + systems/netflix-security-monkey.

  7. Chaos Gorilla validates cross-AZ failover automatically. "Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention." Canonical instance of concepts/availability-zone-failure-drill.

  8. The design posture is a fleet of narrowly-focused agents, not one big chaos engine. Each simian owns one failure mode or one abnormal condition. The composition is the Simian Army shape — see patterns/simian-army-shape. Contrast with monolithic fault-injection frameworks that try to parameterise over all failure modes.

  9. Graceful degradation is a design prerequisite. "We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even regionally-redundant deployments." The Simian Army exercises the graceful-degradation paths the architecture must already have in place. See concepts/graceful-degradation.

  10. Much of the 2011 Simian Army was aspirational. "Parts of the Simian Army have already been built, but much remains an aspiration — waiting for talented engineers to join the effort and make it a reality." The post is partly a recruiting artefact. Later Netflix writing — and the 2012 open-sourcing of Chaos Monkey — fills in the implementation.

Named simians (full enumeration)

Simian Domain Action
Chaos Monkey Instance-level failure Randomly terminates production instances
Latency Monkey Service-degradation + dependency-failure Injects artificial delays in RESTful client/server traffic; very large delays ≈ outage
Conformity Monkey Best-practice drift Finds non-conformant instances (e.g. not in ASG) and shuts them down
Doctor Monkey Unhealthy-instance detection Consumes health checks + CPU signals, evicts + eventually terminates unhealthy instances
Janitor Monkey Cost / waste Finds and disposes of unused resources
Security Monkey Security-posture drift Extends Conformity Monkey; finds misconfigured SG + expiring certs; terminates offenders
10-18 Monkey Localization / internationalization Detects config + runtime problems across geographies / languages / character sets
Chaos Gorilla Availability-zone-level failure Simulates full AZ outage; verifies automatic re-balance across remaining AZs

Operational numbers

None disclosed. The post is an announcement / declaration of design intent. No failure-injection rates, no fleet-size, no MTTR numbers, no cost figures, no incident-reduction numbers. The only quantitative details are taxonomic — the names and domains of the eight simians.

Architecture notes

  • Cloud-as-enabler. "Physical hardware failure is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud." Chaos engineering is a cloud-native discipline because the test itself is cheap.
  • Run in business hours, under observation. Explicit design choice — the drill happens when engineers are available to respond. Enables post-drill analysis and architectural fixes, not just alert noise.
  • One-failure-mode-per-agent. Each simian has a narrow domain. Composition is at the fleet level, not in one engine. See patterns/simian-army-shape.
  • Detection-and-correction simians overload the vocabulary. Conformity + Security + Janitor detect drift and act; they are not fault injectors. 10-18 + Doctor detect problems and correct. Chaos Monkey + Latency Monkey + Chaos Gorilla inject faults. The Simian Army unifies fault injection + drift detection under one umbrella.
  • Implementation details deferred. How the monkeys schedule, authenticate against AWS, bound blast radius, or coordinate with each other is not in the post. Later Netflix work (Chaos Monkey v2, the SimianArmy GitHub repo, the FIT / ChAP platform) discloses these.

Caveats

  • Pre-architecture-diagram announcement post. No code, no architecture diagram, no numbers, no failure-injection rates, no fleet-size, no worked example. Foundational for the discipline, thin on implementation.
  • Partly aspirational. The post explicitly says "much remains an aspiration." Not every simian named here was built at publication; the team was recruiting.
  • Business-hours induction has quiet preconditions. Running Chaos Monkey at 2pm on a Wednesday is only safe on a fleet already designed for graceful degradation; otherwise the drill is the outage. The post mentions "graceful degradation on dependency failures" as a prerequisite but doesn't operationalise it.
  • No blast-radius story. The post describes the intent of each simian but doesn't describe how blast radius is bounded, how canary-vs-production is separated, or what the kill-switch is if a monkey is doing more harm than expected.
  • 10-18 Monkey barely defined. The l10n/i18n monkey is named and has a one-line domain description; no detail on what "configuration and run time problems" it actually inspects.
  • Reporting-in-advance retrospective. The post is written as a present-tense announcement but mixes built + unbuilt simians without marking which are which.
  • Medium republish date confusion. The raw ingest date (2026-01-02) is the Medium republication timestamp; the original was 2011-07-19. This is the oldest full Netflix ingest on the wiki by content date (2011), despite the ingest-file naming date. Claims should be read as 2011 engineering posture unless explicitly qualified.

Cross-references

This post is the wiki's canonical foundational reference for chaos engineering. Later entries on the wiki that rely on chaos-engineering primitives (fault injection, failure testing, graceful-degradation validation, AZ-failure drills) cite back here. The 2011 vocabulary predates the 2016 "chaos engineering" term coined by Netflix engineers (Rosenthal, Hochstein, et al.) — the Simian Army is the practice that the discipline was later named after.

Source

Last updated · 319 distilled / 1,201 read