CONCEPT Cited by 1 source
Chaos engineering¶
Chaos engineering is the discipline of continuously inducing controlled failures in a production system to verify that its fault-tolerance design actually works. The practice predates the term — Netflix's 2011 Simian Army is the canonical origin, named as a discipline around 2016.
Definition¶
"The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime … we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. … But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these 'once in a blue moon' failures." — Izrailevsky & Tseitlin, The Netflix Simian Army (2011).
The core claim: fault tolerance is a property of the exercised architecture, not the designed architecture. Without continuous exercise, resilience features atrophy (libraries update, configs drift, new services don't implement the right timeouts, ASG capacity floors creep down) and the first real failure exposes the rot.
Why chaos engineering is cloud-native¶
The flat-tire analogy from the 2011 post: practising a spare-tire change every Sunday "is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud". Chaos engineering is cheap only when:
- Instance / AZ / region failure is a well-defined primitive the platform exposes.
- The fleet is designed for horizontal redundancy such that losing a single instance is by assumption non-impactful.
- Observability is good enough to detect drill-caused degradation before it reaches customers.
All three preconditions exist in a cloud-native architecture. None of the three are cheap in physical-datacenter operations.
Design primitives¶
- concepts/random-instance-failure-injection — the Chaos Monkey primitive; kill a random instance, verify survival.
- concepts/availability-zone-failure-drill — the Chaos Gorilla primitive; fail a whole AZ, verify automatic re-balance.
- Latency injection — inject artificial delays at RPC boundaries to simulate dependency degradation or outage without instance teardown. See systems/netflix-latency-monkey.
- patterns/continuous-fault-injection-in-production — the scheduling discipline: run drills in business hours, with engineers on hand, in production, so gaps surface where they can be analysed.
- patterns/simian-army-shape — the architectural shape: a fleet of narrowly-focused chaos agents, each owning one failure mode, composed at the fleet level rather than parameterised in one engine.
Prerequisites for chaos engineering¶
Chaos engineering in production is safe only if:
- The fleet is designed for graceful degradation (concepts/graceful-degradation) — the failure mode the drill induces is one the architecture has already planned for.
- Blast radius (concepts/blast-radius) is bounded — a misbehaving chaos agent cannot cause a cascading outage.
- Observability is sufficient to distinguish drill-caused degradation from real customer impact, and to abort the drill automatically if customer impact exceeds a threshold.
- Engineers are on-call and observing during drills — the Netflix posture: "in a carefully monitored environment with engineers standing by to address any problems."
Without these, the drill is the outage.
Related disciplines¶
- Game days — scheduled, manual chaos engineering; the large-surface-area cousin of automated Simian Army drills.
- Failure testing in CI — integration-test-scale fault injection; lower blast radius, lower realism.
- Formal methods (e.g. concepts/lightweight-formal-verification) — prove the architecture can survive failures; chaos engineering validates it empirically. The two are complementary: formal methods catch what you can model; chaos engineering catches what you can't.
- Disaster recovery — operator-driven recovery from failures the architecture did not tolerate. Chaos engineering is a preventive practice; DR is a remediation practice.
Seen in¶
- sources/2026-01-02-netflix-the-netflix-simian-army — the canonical foundational post. Eight named simians (Chaos Monkey, Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla) each implement one slice of the discipline.
Related¶
- systems/netflix-simian-army
- systems/netflix-chaos-monkey
- systems/netflix-chaos-gorilla
- concepts/random-instance-failure-injection
- concepts/availability-zone-failure-drill
- concepts/graceful-degradation
- concepts/blast-radius
- concepts/grey-failure
- patterns/continuous-fault-injection-in-production
- patterns/simian-army-shape