CONCEPT Cited by 1 source
Whole-AZ network partition simulation¶
Definition¶
Whole-AZ network partition simulation is the chaos-engineering drill that programmatically disconnects an availability zone's network from the rest of the cluster — while a real workload runs on the cluster — and observes the system's recovery dynamics: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage.
It is the next-level escalation above per-component fault-injection (kill processes, shoot down nodes, inject network failures, wipe disk contents, restart components in loops) and above single-instance / single-component AZ-failure drills. The drill exercises the AZ as a coherent failure unit — all storage + compute + network in one AZ goes away simultaneously from the rest of the cluster's perspective.
Distinction from related drills¶
| Drill | Scope | Source |
|---|---|---|
| Random instance failure injection (Chaos Monkey) | Single VM | Netflix Simian Army |
| AZ failure drill (Chaos Gorilla) | All instances in one AZ go down | Netflix Simian Army |
| Whole-AZ network partition simulation (this concept) | AZ network is disconnected (instances may still be alive) | Lakebase / Neon |
The structural difference between AZ failure drill and AZ network-partition simulation is what's being simulated:
- AZ failure drill — instances in the AZ go down (terminate / vanish). Counterparts in other AZs see them as missing.
- AZ network partition simulation — instances in the AZ are alive but unreachable from the rest of the cluster. Other AZs see them as missing; the partitioned-AZ instances see other AZs as missing. Split-brain potential.
The split-brain potential makes the partition simulation strictly harder than the failure drill. Storage replication that handles a quorum-loss-with-clean-failure may break under quorum-loss-with-split-brain.
The Lakebase framing¶
Verbatim (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures):
"We're now taking this one level up, from component-level chaos to whole-AZ down simulations. In a real cluster with workloads running, we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage. Our goal is that no workload should be down for more than 30 seconds."
Three observable axes:
- Storage-shift latency. How quickly does the Pageserver+Safekeeper tier shift read/write traffic to surviving replicas in healthy AZs?
- Compute-failover latency. How fast can the control plane detect the AZ-loss, decide on placement, and bring up the affected Postgres computes in healthy AZs?
- Connection-rerouting latency. How fast does the proxy layer detect the AZ-loss and reroute customer connections to the replacement compute?
The composite metric is the per-database outage window — how long any single database sees an outage. The Lakebase target: no workload down for more than 30 seconds.
Why "any individual database" not "fleet"¶
The 30-second target is per-database, consistent with the per-database availability attainment measurement shape. A drill that succeeds for 99% of databases but leaves a tail at 5-minute outages produces a tail- customer-impact result that fleet-aggregate measurement obscures.
Composes with cell-based architecture¶
The whole-AZ-partition drill operates on a single cell, not the whole region. The cell boundary contains the drill's blast radius — the same property that contained the 2026-05-08 us-east-1 thermal-event impact to ~1/8 of the region's databases. See concepts/cell-based-architecture + patterns/cell-based-architecture-for-blast-radius-reduction.
Why it matters¶
- Production reality is whole-AZ events. Real cloud-provider AZ failures (thermal, power, network) take out the whole AZ at once. Per-component drills don't exercise the whole-AZ-coordinated-failure mode where storage / compute / network all go away simultaneously.
- The recovery is composite. No single component's recovery matters if another component's recovery is slow; the per-database-outage-window metric integrates the slowest-recovering link.
- Static stability of the regional architecture is exercised end to end — the surviving cells in the region must absorb the load shift without their own recovery becoming the next critical path. See concepts/static-stability.
Caveats¶
- Methodology not detailed in the source. Lakebase mentions "programmatically disconnect an availability zone's network" but does not name the network-fault-injection mechanism (iptables / SDN partition / VPC-level isolation / bespoke kernel hook). Mature chaos frameworks support multiple methods.
- Pre-conditions for safety. Whole-AZ partition simulation is high-blast-radius compared to per-component drills. Requires rigorous kill-switch + auto-abort discipline, observability attendance, and prior validation that the cell's data-replication invariants are preserved through the drill.
- The 30-second target is aspirational. "Our goal is..." — the source does not disclose actual measured outage windows from the current drill regime.
- Split-brain edge cases. Whole-AZ partition is a cleaner shape than asymmetric partial-network-partition; a partial partition can produce harder-to-detect inconsistencies. The Lakebase post does not detail how partial-partition cases are addressed.
- Cross-cell drills not yet described. The drill operates within a cell; cross-cell partition (where two cells in the same region lose connectivity) is a separate failure shape not covered.
Seen in¶
- sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing. The 30-second per-database outage target. The escalation from component-level chaos to whole-AZ-network drill. The four observable axes (storage / compute / proxy / per-database outage).
Related¶
- concepts/chaos-engineering — parent discipline
- concepts/availability-zone-failure-drill — sibling drill at AZ scope; this concept extends with network-partition (live-but- unreachable) shape vs failure (gone) shape
- concepts/random-instance-failure-injection — finer-grained drill the Lakebase regime continues to use as the lower-tier exercise
- concepts/cell-based-architecture — the unit the drill exercises; cell boundary contains drill blast radius
- concepts/blast-radius — drill's own blast-radius framing
- concepts/static-stability — the property the drill validates
- systems/lakebase / systems/neon — canonical instances
- systems/netflix-chaos-gorilla — historical sibling at AZ-failure altitude
- patterns/whole-az-network-partition-drill — operational pattern
- patterns/continuous-fault-injection-in-production — discipline parent