Skip to content

PATTERN Cited by 1 source

Simian Army shape

The Simian Army shape is the architectural pattern Netflix introduced in the 2011 TechBlog post "The Netflix Simian Army": a fleet of narrowly-focused agents, each owning one failure mode or one abnormal-condition domain, composed at the fleet level rather than parameterised inside a single engine. Named after Netflix's Simian Army.

The pattern

Rather than one chaos-engineering engine configured with many failure-mode knobs, build many small single-purpose agents each of which:

  • Owns exactly one failure mode or exactly one abnormal-condition domain.
  • Has its own operational cadence (Chaos Monkey runs frequently; Chaos Gorilla runs occasionally).
  • Has its own blast-radius discipline (instance / AZ / security- group / cost).
  • Can be enabled, disabled, or tuned independently.

The composition is the Simian Army: the set of agents running in parallel against the production environment.

Canonical Netflix instance

Eight simians, each a single-purpose agent:

Agent Domain Injects or detects
systems/netflix-chaos-monkey Random instance termination Injects
systems/netflix-latency-monkey RPC-boundary delay Injects
systems/netflix-chaos-gorilla Full AZ outage Injects
systems/netflix-conformity-monkey Best-practice drift (e.g. not in ASG) Detects + acts
systems/netflix-security-monkey Security-posture drift + cert lifecycle Detects + acts
systems/netflix-doctor-monkey Unhealthy-instance detection Detects + acts
systems/netflix-janitor-monkey Unused-resource waste Detects + acts
systems/netflix-10-18-monkey Localization / i18n drift Detects + acts

Why narrow agents

  • Each rule set is small and understandable. One monkey's codebase is small enough for one team to own.
  • Blast radius is naturally scoped. If one monkey misbehaves, only its failure domain is affected.
  • Opt-in / opt-out / kill-switch is per-agent. Services can accept Chaos Monkey but not yet Chaos Gorilla while building capacity headroom.
  • Cadence is per-agent. Frequent-low-blast agents run continuously; rare-high-blast agents run on schedule.
  • Team boundaries map cleanly. One monkey per domain means one ownership team per monkey.
  • New failure modes are additive. Adding Latency Monkey after Chaos Monkey doesn't require reworking Chaos Monkey.

Contrast: monolithic chaos engine

A single chaos-engineering engine with parameterised failure modes faces structural problems:

  • Shared kill-switch — pausing one failure mode pauses all.
  • Shared blast-radius accounting — hard to reason about what a drill really does.
  • Shared ownership — one team owns all failure modes; bus factor is high.
  • Cross-mode coupling — a bug in one failure mode's code path can affect another.

The Simian Army shape avoids all four by going the other direction: many small agents, each a separable unit of concern.

Vocabulary overload

The Simian Army's unifying abstraction is "automated agent that enforces a property of the production environment." Under this umbrella, Netflix groups two different kinds of agent:

  • Fault injectors (Chaos Monkey, Latency Monkey, Chaos Gorilla) — actively induce failure.
  • Drift detectors (Conformity, Security, Janitor, Doctor, 10-18 Monkey) — detect abnormal conditions and correct them.

Both fit the pattern. The pattern is "narrow autonomous agent per operational invariant," not "fault injector per failure mode."

Later variants

Netflix's later work on fault injection — FIT (Failure Injection Testing), ChAP (Chaos Automation Platform), and the broader Principles of Chaos Engineering movement — added hypothesis-driven, targeted drills alongside the random-and-continuous Simian Army approach. These newer tools are not Simian-Army-shaped (ChAP is centralised, not a fleet); they complement rather than replace. The Simian Army shape remains the baseline for continuous operational-invariant enforcement; ChAP-shape is the baseline for one-off hypothesis tests.

When to use

  • You have many distinct operational invariants that need continuous enforcement.
  • Each invariant has a clear enforcement action (inject failure, terminate offender, clean up).
  • Blast radius can be bounded per invariant (not all invariants' drills can happen at the same time on the same target).
  • You have graceful-degradation discipline already in place — otherwise the drills cause outages, not learning.

When not to use

  • Only one or two failure modes — the per-agent overhead isn't worth it.
  • The invariants are tightly coupled and can't be drilled independently — a parameterised engine is simpler.
  • Hypothesis-driven drills dominate the workload — a configurable ChAP-shape is a better fit.

Seen in

Last updated · 319 distilled / 1,201 read