Skip to content

SYSTEM Cited by 1 source

Netflix Chaos Gorilla

Chaos Gorilla is the Simian Army member that simulates the outage of an entire Amazon availability zone. Introduced in Netflix's 2011 TechBlog post (Source: sources/2026-01-02-netflix-the-netflix-simian-army) as the big-brother version of systems/netflix-chaos-monkey.

Purpose

"Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention."

Why a separate tool from Chaos Monkey

Chaos Monkey validates instance-level failure tolerance. That is a necessary but insufficient condition for AZ-level failure tolerance:

  • A service that spreads three replicas across three AZs can absorb losing one instance per AZ and still look healthy; losing a whole AZ simultaneously takes out one-third of every service's capacity at once.
  • Auto-scaling-group rebalancing under AZ loss depends on remaining AZs having capacity headroom + correct subnet configuration.
  • Load balancers (ELB at the time) must route around the failed AZ without requiring operator intervention.

Chaos Gorilla is the drill that forces these AZ-failure-domain invariants to be true in practice, not just on paper. See concepts/availability-zone-failure-drill for the generalised concept.

Design goals (from the 2011 post)

  • Automatic re-balance"services automatically re-balance to the functional availability zones".
  • No user-visible impact — the drill is a success only if customers don't notice.
  • No manual intervention — the fleet-wide response should be entirely automated; if operators have to step in, the drill has found a gap.

These three criteria are the AZ-failure tolerance contract Netflix commits its services to meeting.

Implementation gaps in the 2011 post

  • AZ-outage simulation mechanism undocumented (block network? terminate all instances in AZ? rely on AWS primitive?).
  • Cadence undocumented.
  • Blast-radius controls undocumented.
  • Integration with AWS-side AZ failure signals not mentioned.
  • Graceful degradation threshold ("no user-visible impact") not operationalised.

Prerequisites

Chaos Gorilla assumes the fleet already has:

Operational numbers

None disclosed.

Seen in

Last updated · 319 distilled / 1,201 read