CONCEPT Cited by 1 source

Availability zone failure drill¶

An availability zone failure drill is a chaos-engineering exercise that simulates the complete loss of a single cloud availability zone and verifies the fleet re-balances to remaining AZs without customer impact and without manual operator action. Canonical implementation: Netflix's Chaos Gorilla (2011).

The drill¶

"Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention." — Netflix, The Netflix Simian Army (2011).

Three success criteria:

Automatic re-balance — remaining AZs absorb traffic from the failed AZ.
No user-visible impact — customers do not perceive the drill.
No manual intervention — operators do not have to act; the fleet's designed-in response is sufficient.

If any of the three fails, the drill has found an architecture gap.

Why an AZ drill is distinct from an instance drill¶

Single-instance fault injection (concepts/random-instance-failure-injection) is a necessary but insufficient test for AZ resilience. Several failure modes become visible only under AZ-scale load shift:

Capacity headroom — when AZ-1 goes down, do AZ-2 and AZ-3 have enough free capacity to absorb a 50% traffic shift? Or does their utilisation climb to 100% and cascade into their own degradation?
ASG replacement behaviour — does the auto-scaling group try to recover capacity in the failed AZ (where subnets are dead) or correctly shift replacement into surviving AZs?
Load-balancer routing — does the LB detect the AZ-wide unhealthy state and stop routing there, or does it keep trying and add latency to every request via retry?
Shared dependencies in the failed AZ — databases, caches, service-discovery nodes. A service that is "multi-AZ" but depends on a single-AZ backing store is not actually AZ-tolerant.
Asymmetric deployment topology — if one AZ hosts a unique singleton (admin console, control plane), losing it takes that function out even if the serving path is redundant.

An AZ drill surfaces all of these; an instance drill surfaces none of them.

Prerequisites¶

An AZ drill is safe only if:

AZ-level redundancy is real. See concepts/availability-zone-balance and patterns/multi-cluster-active-active-redundancy.
Capacity headroom is sized for AZ loss. Typically one third of total fleet capacity should be idle when there are three AZs.
Graceful degradation paths work (concepts/graceful-degradation).
Blast radius of the drill is bounded (concepts/blast-radius) — a misbehaving drill cannot propagate across AZ boundaries.
Observability is sufficient to abort the drill if customer impact exceeds threshold.

Practical implementations¶

The 2011 post doesn't describe Chaos Gorilla's implementation. In practice, AZ-failure drills are approximated by:

Mass instance termination in one AZ — kill every instance simultaneously in the target AZ.
Network partition — block traffic to / from the target AZ at the VPC boundary (later tools; not 2011-era).
Real AWS AZ failure injection (FIS, post-2020) — uses the cloud provider's own failure-mode primitives.

All three approximate "the AZ is gone" but with different fidelity to real AZ outages (which include partial degradation, slow-recovery, asymmetric-connectivity scenarios).

Operational discipline¶

Cadence — AZ drills are higher-blast-radius than instance drills; cadence is typically weekly / monthly rather than continuous.
Announce-or-not — announced drills catch architecture gaps; unannounced drills also catch operator-response gaps. Both modes are used in mature chaos-engineering programs.
Kill-switch + auto-abort — AZ drills should auto-abort on any customer-facing SLO violation. Netflix's 2011 post doesn't describe this, but it is a later-developed norm.

Seen in¶

systems/netflix-chaos-gorilla — the canonical tool.
sources/2026-01-02-netflix-the-netflix-simian-army — the canonical foundational reference.