PATTERN Cited by 1 source

Whole-AZ network partition drill¶

The pattern¶

Run a chaos drill against a single cell in which the network of one availability zone is programmatically disconnected from the rest of the cluster while a real workload mix runs. Observe the recovery dynamics across storage / compute / proxy / per-database axes. Set a per-database outage-window target (Lakebase's: 30 seconds or less for any single database) and treat the drill's success as "no workload exceeded the target".

The drill is the cell-scoped AZ-loss exercise — it sits one level above per-component drills (kill processes, sever single connections, wipe disks) and one level below cross-cell or regional drills.

The five operational steps¶

Schedule the drill in a chaos cell that carries a real workload mix at stress-level concurrency. Pre-flight: validate the cell's data-replication invariants, ensure observability coverage of all four recovery axes.
Programmatically disconnect the AZ network. Specific mechanism (iptables / SDN / VPC ACL / kernel hook) varies; the structural property is all in-AZ instances become unreachable from out-of-AZ instances and vice versa.
Observe the recovery dynamics on all four axes:
Storage shift — how quickly does Pageserver+Safekeeper route reads/writes to surviving replicas?
Compute failover — how fast does the control plane detect AZ-loss + bring up affected Postgres compute in healthy AZs?
Proxy reroute — how fast does the connection layer detect AZ-loss + reroute customer connections?
Per-database outage window — for each database in the cell, how long was it unable to serve queries?
Compare against the target. "No workload should be down for more than 30 seconds" (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures). Failures are recovery axes that exceeded their budget; the drill surfaces them.
Reconnect the AZ network and validate steady-state recovery. Includes data-consistency validation that no committed transaction was lost during the partition.

Lakebase canonical framing¶

Verbatim (Source: same):

"We're now taking this one level up, from component-level chaos to whole-AZ down simulations. In a real cluster with workloads running, we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage. Our goal is that no workload should be down for more than 30 seconds."

The verbatim "now taking this one level up" framing positions this drill as the next-level escalation above the per-component drill regime — not a replacement.

Per-database target, not fleet target¶

The recovery target is per-database: "how long any individual database sees an outage". This is the operational analogue of per-database availability attainment — fleet-aggregate is not enough because tail-customer impact is the SLA-relevant signal.

Drill safety preconditions¶

Whole-AZ partition is higher blast-radius than per-component drills. Preconditions for safe execution:

Cell-bounded by construction. The drill operates inside one cell; the cell boundary contains the drill's blast radius. Don't run this in a multi-cell-shared environment.
Kill-switch + auto-abort on customer-visible SLO violations beyond the drill's expected envelope.
Observability attendance. Engineers actively watching the drill — see continuous fault injection in production for the broader discipline.
Workload mix is real but bounded. Stress-level concurrency produces signal; production-customer workload exposure is bounded by operating in a chaos cell, not a customer cell.
Data-consistency validators armed. SQLancer / SQLsmith / internal harnesses run during the drill so post-recovery invariants are checked.

Composability¶

With concepts/cell-based-architecture — cell boundary is what makes the drill scoped + safe.
With failpoints — failpoints handle the fine-grained per-component edge cases the AZ-partition drill doesn't reach.
With concepts/static-stability — the property the drill validates is whether surviving AZs absorb the load shift without their own recovery becoming the next critical path.
With concepts/database-availability-attainment — the drill's recovery target generalises to the same per-database shape as the production SLO.

Distinction from prior patterns¶

Pattern	Year	Mechanism
Netflix Chaos Gorilla	2011	AZ-instance-failure (instances vanish)
Lakebase whole-AZ partition drill	2026	AZ-network-partition (instances alive but unreachable)

Network-partition is structurally harder than instance-failure because of split-brain potential — see concepts/whole-az-network-partition-simulation for the distinction.

Caveats¶

Aspirational target disclosure. "Our goal is..." — the source does not disclose actual measured outage windows from the current drill regime. The 30-second target is forward- looking.
Methodology not detailed. Specific network-fault-injection mechanism not named.
Asymmetric / partial-partition not covered. The drill is the symmetric whole-AZ case; partial / asymmetric partitions are a separate harder shape.
Drill cadence not disclosed. "Every Lakebase release goes through failure injection and chaos testing" — but the AZ- partition-specific cadence (per-release, weekly, monthly) is not stated.
Multi-cell drill not covered. Within-cell only; cross-cell drill would be a separate (higher-blast-radius) exercise.

Seen in¶

sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — canonical wiki framing. The four recovery axes named verbatim. The 30-second-or-better per-database target. The "now taking this one level up" escalation framing.

concepts/whole-az-network-partition-simulation — the concept this pattern operationalises
concepts/chaos-engineering — discipline parent
concepts/availability-zone-failure-drill — sibling drill pattern at the AZ-instance-failure altitude (vs network-partition here)
concepts/cell-based-architecture — the unit the drill operates on
concepts/blast-radius — drill safety framing
concepts/static-stability — what the drill validates
concepts/database-availability-attainment — the production metric the drill's recovery target generalises to
systems/lakebase / systems/neon — canonical instances
patterns/continuous-fault-injection-in-production — parent discipline pattern
patterns/cell-based-architecture-for-blast-radius-reduction — sibling pattern; the cell boundary is what makes the drill safe