Databricks — How the lakebase architecture stays resilient to cloud failures¶
Summary¶
Tier-3 Databricks reliability post (Jasraj Dange, Hans Norheim, Stas Kelvich, John Spray; published 2026-05-27) that lays out systems/lakebase's reliability roadmap by reframing serverless-Postgres architecture for the agentic-workload era — where agents "create 4× as many databases as humans do" and Databricks is "starting tens of millions of databases every day." The post's central architectural argument is that the part of the control plane that starts databases is effectively the data plane under agentic / on-demand workloads, and the resulting reliability roadmap covers six pillars: (1) HA architecture from stateless Postgres compute on zone-redundant storage, (2) splitting a dedicated data-plane controller out of the control plane for hot-path start/suspend operations, (3) minimising critical-path dependencies (especially on cloud-provider control planes — provision a buffered pool of bare-metal instances with an in-house vertical-autoscaling virtualization layer rather than pay-per-VM through the cloud-provider control plane), (4) cell-based architecture: each region is composed of identically-shaped self-contained cells (Kubernetes + control plane + compute + storage); during the 2026-05-08 us-east-1 AWS thermal-event outage one cell failed-over imperfectly while seven cells failed over correctly — impact bounded to ~13% of databases in the region (~1/8), an order-of-magnitude blast-radius reduction, (5) failure simulation and injection via failpoints in code + an internal fault-injection framework + open-source SQLancer / SQLsmith for correctness validation, escalating from process / node / disk / network faults to whole-AZ network partition simulations with a no-workload-down-more-than-30-seconds target, (6) per-database availability attainment (% of fleet meeting 99.99% / 99.95% monthly) as the SLO measurement substrate — the published 2026 H1 attainment table shows ~99.95% / ~99.81-99.85% with a small dip in April. Passes scope cleanly on distributed-systems-internals / production-architecture / scaling-trade-offs / cloud-provider-outage grounds despite the customer-facing framing.
Key takeaways¶
-
Agentic workloads turn the control plane into the data plane. Verbatim: "With agentic and on-demand workloads, the part of the control plane that starts databases is effectively the data plane. This has changed how we think about our architecture. Currently, our control plane handles everything from starting databases to billing. The former is clearly more critical." The structural answer: split a data plane controller service out that handles only hot-path start/suspend operations, with "less business logic, a strict, minimal set of external dependencies, and is engineered from the ground up with resilience, graceful degradation, and defense-in-depth top of mind." (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Stateless Postgres compute on zone-redundant storage replaces the hot-standby tax. Verbatim: "Unlike many cloud Postgres database service setups that are monolithic and have stateful compute, Postgres in the lakebase architecture is stateless. All durable data lives in a remote storage service, so the compute process holds no durable state on the local disk. If Postgres or the hardware it runs on fails, it can be instantly replaced without replicating data to a hot standby or running usual Postgres crash recovery. A hot standby in a monolithic setup requires a full copy of the data (not free), while crash recovery must replay the write-ahead log from the last checkpoint, which scales with the write rate at the time of the crash and can take 10s of minutes, depending on configuration." The architectural payoff: "a single-compute Postgres instance in Lakebase has significantly improved availability compared to a single stateful Postgres instance, without the cost of an additional hot standby compute instance." (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Zone redundancy is the default for all storage tiers, regardless of database tier. Verbatim: "Monolithic Postgres setups are usually backed by local block devices that are rarely zone-redundant. This necessitates physical replication and costly hot standby replicas across multiple availability zones. In Lakebase and Neon, all databases, regardless of tier and configuration, are backed by distributed, zone-redundant, highly available storage. Data is stored in highly durable, zone-redundant object storage, and performance is accelerated by NVMe SSD caches across multiple availability zones at no additional cost to you." First-class wiki disclosure that the Pageserver+Safekeeper storage tier is structurally zone-redundant — the "hot standby across AZ for a copy of the data" tax is eliminated at the storage layer for every customer, not just the HA tier. (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Compute session lifetimes are short — the control plane is load-bearing for steady-state. Verbatim: "In Neon, 90% of compute sessions for auto-suspending databases are less than 10 minutes." The implication: control-plane start operations happen on every cold compute session, and 90% of sessions are short enough that every connection-arrival from an idle client triggers a control-plane start — the start path's reliability is on the request-path for the workload, not just an off-path management operation. (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Critical-path dependency minimisation: own the control-plane chain. Verbatim: "Reliability is strongly correlated with the dependency chain and the amount of machinery involved in the flow. In a traditional setup with Postgres in cloud provider VMs, this goes well beyond the data plane: Cloud provider's compute control plane to provision VMs / Available VM capacity / Cloud provider's block store control plane to provision local storage / Cloud provider's networking control plane to allocate IPs, configure firewalls and network routes to the new VM / If using Kubernetes — an additional dependency on the K8s system services." Lakebase's architectural reply: "We allocate a pool of big (often bare metal) instances from the cloud provider. We carry buffers to sustain cloud provider provisioning outages. We built our own vertically autoscaling virtualization layer that schedules multiple Postgres instances onto those cloud instances. We don't rely on cloud block store devices, but instead store data in our own zone-resilient storage that is ultimately backed in object stores like S3 or Azure Blob storage." This is the operational instantiation of availability multiplication — pre-allocated bare-metal pool with provisioning buffer collapses three separate cloud-provider control-plane dependencies (compute / block / network) into one already-completed dependency. (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Cells are the regional composition unit; AWS us-east-1 thermal-event outage 2026-05-08 contained to ~13% of databases. Verbatim: "Rather than running a single monolithic regional deployment, Lakebase composes a region from one or more identically shaped cells. A cell is a complete, self-contained slice of the Neon and Lakebase stack: Kubernetes, control plane, compute, and storage." The 2026-05-08 outage measurement: "During an incident on May 8, 2026, when AWS experienced issues with an Availability Zone in us-east-1, one of the cells had issues failing over to healthy nodes. The impact was contained to that cell. The other seven cells in the region failed over correctly, so the incident affected only ~13% of databases in the region. In this case, the cell-based architecture reduced the impact by roughly an order of magnitude." First wiki canonical-instance of cell-as-blast-radius- reduction quantified in a real production AWS-AZ outage. Cells double as the scaling unit — "To grow a region, we add another cell. When an existing Cell approaches scalability limits of Kubernetes and control plane, new project creation is routed to a freshly provisioned Cell." (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Failpoints + chaos testing escalating to whole-AZ network partition; 30-second-or-better target for any single database. Verbatim: "Every Lakebase release goes through failure injection and chaos testing before it goes to production. We deploy the release to a real cluster, drive it with a mix of agentic and non-agentic OLTP and OLAP workloads at stress-level concurrency, and then start breaking things underneath. We kill processes, shoot down nodes, inject network failures, wipe disk contents, and restart components in loops, all while the workload keeps running. We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell." And on correctness: "We utilize open source tools like SqlLancer and SqlSmith, along with similar internal tools, to verify correct Postgres behavior. While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own." And on the next-level escalation: "We're now taking this one level up, from component-level chaos to whole-AZ down simulations. In a real cluster with workloads running, we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage. Our goal is that no workload should be down for more than 30 seconds." (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
Per-database availability attainment is the SLO measurement shape, not fleet aggregate. Verbatim: "Database Availability: How many percent of the time every individual database is available. We don't just measure aggregate fleet availability, because an individual customer doesn't care if the fleet had great availability if their database was down." And: "Our goal is for every database to exceed 99.99% availability every month. We measure how close we are to that goal with attainment: How many % of the fleet's databases that met the goal." Disclosed 2026 H1 attainment data:
| Month | Met 99.95% | Met 99.99% |
|---|---|---|
| 2026-01 | 99.96% | 99.85% |
| 2026-02 | 99.95% | 99.84% |
| 2026-03 | 99.96% | 99.81% |
| 2026-04 | 99.93% | 99.75% |
First wiki disclosure of attainment as the per-database availability metric for serverless-Postgres SLO measurement. April's small dip is unexplained in the post but visible. (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
-
The five disclosed SLI categories: database availability, database startup time, switchover/failover frequency+latency, storage page-read/durable-write availability+latency, control-plane API success rate+latency. Verbatim list:
-
"Database Availability: How many percent of the time every individual database is available."
- "Database Startup Time: How quickly a suspended database becomes available when you connect, or how quickly a brand new database is starts up."
- "Database switchover/failover: Frequency and latency. As infrequent as possible, and as quickly as possible when it does happen."
- "Storage: Availability and latency of page reads and durable writes from Postgres to storage. These tell us whether your workload gets what it needs."
- "Control Plane APIs: Success rates and latency of important operations such as branching."
This is the operational SLI menu specifically tuned for serverless / scale-to-zero Postgres — the "database startup time" SLI is structurally absent from monolithic-Postgres SLO menus because monolithic Postgres is always-on. (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
Operational numbers¶
- ~13% of databases in us-east-1 affected during the AWS 2026-05-08 thermal-event AZ outage; ~1/8 = the cell ratio at that time → eight cells per region in us-east-1 (one failed imperfectly, seven failed correctly).
- ~order-of-magnitude blast-radius reduction from the cell architecture vs a hypothetical monolithic regional deployment (verbatim: "the cell-based architecture reduced the impact by roughly an order of magnitude").
- 90% of compute sessions for auto-suspending databases in Neon are <10 minutes.
- 4× more databases created by agents than humans on Lakebase (Source link verbatim — "agents create 4x as many databases as humans do").
- Tens of millions of database starts per day on Lakebase.
- 30 seconds: target maximum outage window for any single database under whole-AZ network partition simulation.
- 99.99%: target monthly per-database availability.
- 99.93-99.96% of databases met the 99.95% bar in 2026 H1.
- 99.75-99.85% of databases met the 99.99% bar in 2026 H1.
- Crash recovery on a stateful monolithic Postgres takes 10s of minutes depending on configuration.
Architectural diagrams referenced¶
- Compute session lifetime histogram (Neon, auto-suspending databases): 90% < 10 min — the load-bearing chart for the "control plane is the new data plane" argument.
- Cell composition diagram: a region as N identically-shaped cells, each containing a complete Kubernetes + control plane + compute + storage stack.
Caveats¶
- Tier-3 source. Databricks Engineering blog; passes scope on production-architecture-internals / cloud-provider-outage / serverless-Postgres-SLO grounds, but framing carries customer-facing "we're hard at work building your trust" load.
- Some pillars are aspirational. Verbatim disclaimer: "Some items are already in production, others are in flight." The data-plane-controller separation is "currently hard at work"; the whole-AZ network-partition simulations are "now taking this one level up". Distinguish landed from in-flight.
- No diagram of the data-plane-controller boundary. The post describes the split (hot-path vs business-logic) but does not name the existing services being separated or the migration mechanism.
- Cell count for us-east-1 is implicit, not stated. "~13% of databases in the region" with "the other seven cells failed over correctly" implies eight cells, but the post does not state the total directly; if the affected cell hosted more than its uniform share, the cell ratio could differ.
- Cell sizing policy and routing-during-failure not disclosed. "Cells are spun up quickly as demand grows" and "new project creation is routed to a freshly provisioned Cell" — but the cell router design (stateless? per-cell DNS? versioned?), capacity ceilings (Kubernetes cluster size? control-plane RPS?), and cross-cell migration policy are not described.
- Vertically-autoscaling virtualization layer named but not detailed. Linked separately to a Neon docs page; this post does not describe the scheduling algorithm, isolation primitives, or noisy-neighbour mitigations of the in-house virtualization layer.
- Provisioning-buffer sizing policy not disclosed. "We carry buffers to sustain cloud provider provisioning outages" — buffer size, replenishment cadence, and the cloud-provider-outage-duration assumption that calibrates the buffer are not stated.
- Failpoint mechanism not detailed. Failpoints are named but the
language-level instrumentation (compile-time? feature flag?
runtime-injectable?), the granularity (per-call? per-line?), and
the production-vs-test scoping are not described. Standard term in
the Postgres / Rust ecosystem (
failpointscrate /INJECTION_POINTmacro), but the specific implementation is unstated. - Whole-AZ partition methodology not detailed. "We programmatically disconnect an availability zone's network from the rest of the cluster" — the network-fault-injection mechanism (iptables? per-host kernel hook? virtualised SDN-level partition?) and the cluster-state observability during the drill are not described.
- April attainment dip unexplained. Met-99.95% drops from 99.96% (Mar) to 99.93% (Apr); Met-99.99% drops from 99.81% to 99.75%. No incident disclosure, no reasoning, no impact analysis.
- Static stability not named verbatim. The architectural pattern is described in operational language ("buffers to sustain provisioning outages", "controls who gets it") but the named principle static stability (Max Englander framing) is not invoked. Likely deliberate per Databricks' blog-style register; cross-references in this wiki are retrofit.
- No quantitative comparison vs single-AZ Postgres. "Significantly improved availability" is the verbatim claim; specific percentage-point comparison vs a single-AZ stateful-Postgres baseline is not given.
- No customer impact distribution during 2026-05-08. "Affected ~13% of databases" — but no breakdown of how many of those were HA-tier (multi-AZ compute) vs single-compute, no duration-per-database, no recovery-from-customer-perspective data (zero data loss is implicit but not stated).
Source¶
- Original: https://www.databricks.com/blog/how-lakebase-architecture-stays-resilient-cloud-failures
- Raw markdown:
raw/databricks/2026-05-27-how-the-lakebase-architecture-stays-resilient-to-cloud-failu-e6056161.md
Related¶
- systems/lakebase — the system this article reorients toward an agentic-workload reliability roadmap
- systems/neon — the underlying database (architecture-identical to Lakebase per the 2026-04-29 Stripe-Projects ingest)
- systems/pageserver-safekeeper — the zone-redundant storage tier that makes stateless-Postgres-compute viable
- concepts/control-plane-as-the-new-data-plane — central new concept canonicalised here
- concepts/control-plane-data-plane-separation — the architectural parent
- concepts/cell-based-architecture — the regional-composition primitive; canonical Lakebase production-AZ-outage instance
- concepts/blast-radius — quantified ~1/8 = ~13% in the 2026-05-08 us-east-1 outage
- concepts/critical-path-dependency-minimization — replace cloud-provider-control-plane chain with bare-metal pool + in-house vertical-autoscaling virtualization
- concepts/static-stability — buffer-pool reasoning is a static- stability instantiation
- concepts/whole-az-network-partition-simulation — the next-level chaos drill escalation
- concepts/failpoint — the in-code injection primitive
- concepts/database-availability-attainment — the per-database-monthly-SLO-fleet-attainment metric
- concepts/database-startup-time-sli — the serverless-specific startup-latency SLI
- patterns/preallocated-bare-metal-pool-with-virtualization — the cloud-provider-control-plane-bypass primitive
- patterns/separate-data-plane-controller-for-hot-path — the control-plane decomposition pattern
- patterns/whole-az-network-partition-drill — the chaos drill
- patterns/per-database-availability-attainment — the SLO measurement pattern
- patterns/cell-based-architecture-for-blast-radius-reduction — the parent pattern; Lakebase is the new canonical production-test instance
- patterns/continuous-fault-injection-in-production — the parent pattern; Lakebase escalates the per-component shape with cell-wide
- whole-AZ
- companies/databricks