PATTERN Cited by 1 source

Simian Army shape¶

The Simian Army shape is the architectural pattern Netflix introduced in the 2011 TechBlog post "The Netflix Simian Army": a fleet of narrowly-focused agents, each owning one failure mode or one abnormal-condition domain, composed at the fleet level rather than parameterised inside a single engine. Named after Netflix's Simian Army.

The pattern¶

Rather than one chaos-engineering engine configured with many failure-mode knobs, build many small single-purpose agents each of which:

Owns exactly one failure mode or exactly one abnormal-condition domain.
Has its own operational cadence (Chaos Monkey runs frequently; Chaos Gorilla runs occasionally).
Has its own blast-radius discipline (instance / AZ / security- group / cost).
Can be enabled, disabled, or tuned independently.

The composition is the Simian Army: the set of agents running in parallel against the production environment.

Canonical Netflix instance¶

Eight simians, each a single-purpose agent:

Agent	Domain	Injects or detects
systems/netflix-chaos-monkey	Random instance termination	Injects
systems/netflix-latency-monkey	RPC-boundary delay	Injects
systems/netflix-chaos-gorilla	Full AZ outage	Injects
systems/netflix-conformity-monkey	Best-practice drift (e.g. not in ASG)	Detects + acts
systems/netflix-security-monkey	Security-posture drift + cert lifecycle	Detects + acts
systems/netflix-doctor-monkey	Unhealthy-instance detection	Detects + acts
systems/netflix-janitor-monkey	Unused-resource waste	Detects + acts
systems/netflix-10-18-monkey	Localization / i18n drift	Detects + acts

Why narrow agents¶

Each rule set is small and understandable. One monkey's codebase is small enough for one team to own.
Blast radius is naturally scoped. If one monkey misbehaves, only its failure domain is affected.
Opt-in / opt-out / kill-switch is per-agent. Services can accept Chaos Monkey but not yet Chaos Gorilla while building capacity headroom.
Cadence is per-agent. Frequent-low-blast agents run continuously; rare-high-blast agents run on schedule.
Team boundaries map cleanly. One monkey per domain means one ownership team per monkey.
New failure modes are additive. Adding Latency Monkey after Chaos Monkey doesn't require reworking Chaos Monkey.

Contrast: monolithic chaos engine¶

A single chaos-engineering engine with parameterised failure modes faces structural problems:

Shared kill-switch — pausing one failure mode pauses all.
Shared blast-radius accounting — hard to reason about what a drill really does.
Shared ownership — one team owns all failure modes; bus factor is high.
Cross-mode coupling — a bug in one failure mode's code path can affect another.

The Simian Army shape avoids all four by going the other direction: many small agents, each a separable unit of concern.

Vocabulary overload¶

The Simian Army's unifying abstraction is "automated agent that enforces a property of the production environment." Under this umbrella, Netflix groups two different kinds of agent:

Fault injectors (Chaos Monkey, Latency Monkey, Chaos Gorilla) — actively induce failure.
Drift detectors (Conformity, Security, Janitor, Doctor, 10-18 Monkey) — detect abnormal conditions and correct them.

Both fit the pattern. The pattern is "narrow autonomous agent per operational invariant," not "fault injector per failure mode."

Later variants¶

Netflix's later work on fault injection — FIT (Failure Injection Testing), ChAP (Chaos Automation Platform), and the broader Principles of Chaos Engineering movement — added hypothesis-driven, targeted drills alongside the random-and-continuous Simian Army approach. These newer tools are not Simian-Army-shaped (ChAP is centralised, not a fleet); they complement rather than replace. The Simian Army shape remains the baseline for continuous operational-invariant enforcement; ChAP-shape is the baseline for one-off hypothesis tests.

When to use¶

You have many distinct operational invariants that need continuous enforcement.
Each invariant has a clear enforcement action (inject failure, terminate offender, clean up).
Blast radius can be bounded per invariant (not all invariants' drills can happen at the same time on the same target).
You have graceful-degradation discipline already in place — otherwise the drills cause outages, not learning.

When not to use¶

Only one or two failure modes — the per-agent overhead isn't worth it.
The invariants are tightly coupled and can't be drilled independently — a parameterised engine is simpler.
Hypothesis-driven drills dominate the workload — a configurable ChAP-shape is a better fit.

Seen in¶

sources/2026-01-02-netflix-the-netflix-simian-army — the canonical founding reference.