Skip to content

PATTERN Cited by 1 source

Whole-AZ network partition drill

The pattern

Run a chaos drill against a single cell in which the network of one availability zone is programmatically disconnected from the rest of the cluster while a real workload mix runs. Observe the recovery dynamics across storage / compute / proxy / per-database axes. Set a per-database outage-window target (Lakebase's: 30 seconds or less for any single database) and treat the drill's success as "no workload exceeded the target".

The drill is the cell-scoped AZ-loss exercise — it sits one level above per-component drills (kill processes, sever single connections, wipe disks) and one level below cross-cell or regional drills.

The five operational steps

  1. Schedule the drill in a chaos cell that carries a real workload mix at stress-level concurrency. Pre-flight: validate the cell's data-replication invariants, ensure observability coverage of all four recovery axes.
  2. Programmatically disconnect the AZ network. Specific mechanism (iptables / SDN / VPC ACL / kernel hook) varies; the structural property is all in-AZ instances become unreachable from out-of-AZ instances and vice versa.
  3. Observe the recovery dynamics on all four axes:
  4. Storage shift — how quickly does Pageserver+Safekeeper route reads/writes to surviving replicas?
  5. Compute failover — how fast does the control plane detect AZ-loss + bring up affected Postgres compute in healthy AZs?
  6. Proxy reroute — how fast does the connection layer detect AZ-loss + reroute customer connections?
  7. Per-database outage window — for each database in the cell, how long was it unable to serve queries?
  8. Compare against the target. "No workload should be down for more than 30 seconds" (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures). Failures are recovery axes that exceeded their budget; the drill surfaces them.
  9. Reconnect the AZ network and validate steady-state recovery. Includes data-consistency validation that no committed transaction was lost during the partition.

Lakebase canonical framing

Verbatim (Source: same):

"We're now taking this one level up, from component-level chaos to whole-AZ down simulations. In a real cluster with workloads running, we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage. Our goal is that no workload should be down for more than 30 seconds."

The verbatim "now taking this one level up" framing positions this drill as the next-level escalation above the per-component drill regime — not a replacement.

Per-database target, not fleet target

The recovery target is per-database: "how long any individual database sees an outage". This is the operational analogue of per-database availability attainment — fleet-aggregate is not enough because tail-customer impact is the SLA-relevant signal.

Drill safety preconditions

Whole-AZ partition is higher blast-radius than per-component drills. Preconditions for safe execution:

  • Cell-bounded by construction. The drill operates inside one cell; the cell boundary contains the drill's blast radius. Don't run this in a multi-cell-shared environment.
  • Kill-switch + auto-abort on customer-visible SLO violations beyond the drill's expected envelope.
  • Observability attendance. Engineers actively watching the drill — see continuous fault injection in production for the broader discipline.
  • Workload mix is real but bounded. Stress-level concurrency produces signal; production-customer workload exposure is bounded by operating in a chaos cell, not a customer cell.
  • Data-consistency validators armed. SQLancer / SQLsmith / internal harnesses run during the drill so post-recovery invariants are checked.

Composability

Distinction from prior patterns

Pattern Year Mechanism
Netflix Chaos Gorilla 2011 AZ-instance-failure (instances vanish)
Lakebase whole-AZ partition drill 2026 AZ-network-partition (instances alive but unreachable)

Network-partition is structurally harder than instance-failure because of split-brain potential — see concepts/whole-az-network-partition-simulation for the distinction.

Caveats

  • Aspirational target disclosure. "Our goal is..." — the source does not disclose actual measured outage windows from the current drill regime. The 30-second target is forward- looking.
  • Methodology not detailed. Specific network-fault-injection mechanism not named.
  • Asymmetric / partial-partition not covered. The drill is the symmetric whole-AZ case; partial / asymmetric partitions are a separate harder shape.
  • Drill cadence not disclosed. "Every Lakebase release goes through failure injection and chaos testing" — but the AZ- partition-specific cadence (per-release, weekly, monthly) is not stated.
  • Multi-cell drill not covered. Within-cell only; cross-cell drill would be a separate (higher-blast-radius) exercise.

Seen in

Last updated · 542 distilled / 1,571 read