SYSTEM Cited by 1 source
Netflix Chaos Gorilla¶
Chaos Gorilla is the Simian Army member that simulates the outage of an entire Amazon availability zone. Introduced in Netflix's 2011 TechBlog post (Source: sources/2026-01-02-netflix-the-netflix-simian-army) as the big-brother version of systems/netflix-chaos-monkey.
Purpose¶
"Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention."
Why a separate tool from Chaos Monkey¶
Chaos Monkey validates instance-level failure tolerance. That is a necessary but insufficient condition for AZ-level failure tolerance:
- A service that spreads three replicas across three AZs can absorb losing one instance per AZ and still look healthy; losing a whole AZ simultaneously takes out one-third of every service's capacity at once.
- Auto-scaling-group rebalancing under AZ loss depends on remaining AZs having capacity headroom + correct subnet configuration.
- Load balancers (ELB at the time) must route around the failed AZ without requiring operator intervention.
Chaos Gorilla is the drill that forces these AZ-failure-domain invariants to be true in practice, not just on paper. See concepts/availability-zone-failure-drill for the generalised concept.
Design goals (from the 2011 post)¶
- Automatic re-balance — "services automatically re-balance to the functional availability zones".
- No user-visible impact — the drill is a success only if customers don't notice.
- No manual intervention — the fleet-wide response should be entirely automated; if operators have to step in, the drill has found a gap.
These three criteria are the AZ-failure tolerance contract Netflix commits its services to meeting.
Implementation gaps in the 2011 post¶
- AZ-outage simulation mechanism undocumented (block network? terminate all instances in AZ? rely on AWS primitive?).
- Cadence undocumented.
- Blast-radius controls undocumented.
- Integration with AWS-side AZ failure signals not mentioned.
- Graceful degradation threshold ("no user-visible impact") not operationalised.
Prerequisites¶
Chaos Gorilla assumes the fleet already has:
- concepts/availability-zone-balance — replicas spread across AZs with enough capacity headroom to absorb one AZ's loss.
- concepts/graceful-degradation on dependency failures so that partial AZ loss propagates as degraded-but-serving, not cascading outage.
- patterns/multi-cluster-active-active-redundancy — the active-active AZ topology Chaos Gorilla validates.
Operational numbers¶
None disclosed.
Seen in¶
- sources/2026-01-02-netflix-the-netflix-simian-army — the canonical founding reference.
Related¶
- companies/netflix
- systems/netflix-simian-army
- systems/netflix-chaos-monkey
- concepts/chaos-engineering
- concepts/availability-zone-failure-drill
- concepts/availability-zone-balance
- concepts/graceful-degradation
- patterns/continuous-fault-injection-in-production
- patterns/multi-cluster-active-active-redundancy