CONCEPT Cited by 4 sources
Chaos engineering¶
Chaos engineering is the discipline of continuously inducing controlled failures in a production system to verify that its fault-tolerance design actually works. The practice predates the term — Netflix's 2011 Simian Army is the canonical origin, named as a discipline around 2016.
Definition¶
"The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime … we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system. … But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these 'once in a blue moon' failures." — Izrailevsky & Tseitlin, The Netflix Simian Army (2011).
The core claim: fault tolerance is a property of the exercised architecture, not the designed architecture. Without continuous exercise, resilience features atrophy (libraries update, configs drift, new services don't implement the right timeouts, ASG capacity floors creep down) and the first real failure exposes the rot.
Why chaos engineering is cloud-native¶
The flat-tire analogy from the 2011 post: practising a spare-tire change every Sunday "is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud". Chaos engineering is cheap only when:
- Instance / AZ / region failure is a well-defined primitive the platform exposes.
- The fleet is designed for horizontal redundancy such that losing a single instance is by assumption non-impactful.
- Observability is good enough to detect drill-caused degradation before it reaches customers.
All three preconditions exist in a cloud-native architecture. None of the three are cheap in physical-datacenter operations.
Design primitives¶
- concepts/random-instance-failure-injection — the Chaos Monkey primitive; kill a random instance, verify survival.
- concepts/availability-zone-failure-drill — the Chaos Gorilla primitive; fail a whole AZ, verify automatic re-balance.
- Latency injection — inject artificial delays at RPC boundaries to simulate dependency degradation or outage without instance teardown. See systems/netflix-latency-monkey.
- patterns/continuous-fault-injection-in-production — the scheduling discipline: run drills in business hours, with engineers on hand, in production, so gaps surface where they can be analysed.
- patterns/simian-army-shape — the architectural shape: a fleet of narrowly-focused chaos agents, each owning one failure mode, composed at the fleet level rather than parameterised in one engine.
Prerequisites for chaos engineering¶
Chaos engineering in production is safe only if:
- The fleet is designed for graceful degradation (concepts/graceful-degradation) — the failure mode the drill induces is one the architecture has already planned for.
- Blast radius (concepts/blast-radius) is bounded — a misbehaving chaos agent cannot cause a cascading outage.
- Observability is sufficient to distinguish drill-caused degradation from real customer impact, and to abort the drill automatically if customer impact exceeds a threshold.
- Engineers are on-call and observing during drills — the Netflix posture: "in a carefully monitored environment with engineers standing by to address any problems."
Without these, the drill is the outage.
Related disciplines¶
- Game days — scheduled, manual chaos engineering; the large-surface-area cousin of automated Simian Army drills.
- Failure testing in CI — integration-test-scale fault injection; lower blast radius, lower realism.
- Formal methods (e.g. concepts/lightweight-formal-verification) — prove the architecture can survive failures; chaos engineering validates it empirically. The two are complementary: formal methods catch what you can model; chaos engineering catches what you can't.
- Disaster recovery — operator-driven recovery from failures the architecture did not tolerate. Chaos engineering is a preventive practice; DR is a remediation practice.
Seen in¶
- sources/2026-01-02-netflix-the-netflix-simian-army — the canonical foundational post. Eight named simians (Chaos Monkey, Latency Monkey, Conformity Monkey, Doctor Monkey, Janitor Monkey, Security Monkey, 10-18 Monkey, Chaos Gorilla) each implement one slice of the discipline.
-
sources/2026-04-28-expedia-expedias-service-telemetry-analyzer — Expedia's chaos-engineering platform as complement to LLM-based RCA. Expedia's 2018 chaos-engineering platform "lacked a mechanism for the automatic evaluation of experimental results"; the 2026 STAR service fills that gap — STAR analyses the observability signals produced by an injected failure and drafts a structured assessment. Canonical wiki instance of chaos-engineering paired with an LLM-based experiment-result evaluator downstream, rather than relying on humans to interpret the blast-radius signals.
-
sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — Canonical wiki instance of chaos-engineering on a managed-Postgres release-gate at three escalating altitudes: (1) Per-component: "We deploy the release to a real cluster, drive it with a mix of agentic and non-agentic OLTP and OLAP workloads at stress-level concurrency, and then start breaking things underneath. We kill processes, shoot down nodes, inject network failures, wipe disk contents, and restart components in loops, all while the workload keeps running." (2) Code-level via failpoints: "We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell." (3) Whole-AZ network-partition (in flight): "We're now taking this one level up, from component-level chaos to whole-AZ down simulations… we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts." The drill regime is paired with correctness validators SQLancer + SQLsmith running concurrently with fault-injection — "While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own." First wiki canonicalisation of the chaos + correctness-validator coupling as a release-gate substrate, distinct from Netflix's monitoring-as-validation approach. Three new wiki canonicalisations: (a) failpoints as the in-code injection primitive distinct from Netflix's external Simian Army; (b) whole-AZ network partition (live-but-unreachable) distinct from Chaos Gorilla's AZ-instance-failure (gone); (c) 30-second per-database outage target as the drill's pass criterion — composes with per-database availability attainment as the production SLO substrate.
Related¶
- systems/netflix-simian-army
- systems/netflix-chaos-monkey
- systems/netflix-chaos-gorilla
- systems/expedia-star — LLM-based experiment-result evaluator paired with chaos-engineering at Expedia.
- concepts/random-instance-failure-injection
- concepts/availability-zone-failure-drill
- concepts/graceful-degradation
- concepts/blast-radius
- concepts/grey-failure
- concepts/automated-root-cause-analysis — the discipline the Expedia chaos-engineering + STAR composition realises.
- patterns/continuous-fault-injection-in-production
- patterns/simian-army-shape
- patterns/multi-step-rca-workflow — STAR's workflow shape, applied to chaos-experiment results.