SYSTEM Cited by 12 sources
Lakebase¶
Lakebase is Databricks' serverless, Postgres-compatible OLTP database offering. It descends architecturally from Neon, the separated-storage-compute Postgres that Databricks acquired in 2025, and is positioned as the transactional companion to the analytical side of Databricks' Data Intelligence Platform.
Minimum viable framing for this wiki: Lakebase is a managed Postgres where the persistent state (pages + WAL) lives in systems/pageserver-safekeeper on object storage + local caches, and the Postgres compute VMs on top are ephemeral — they scale up, down, or to zero based on demand (concepts/stateless-compute). Each ingested source surfaces a different slice of its design; fill in here as they land.
Architecture slice — storage/compute separation¶
From the CMK launch post (only Lakebase post ingested so far):
- Persistence layer (storage). Long-lived state in object storage and local caches, served by the Pageserver and Safekeeper components. The Pageserver owns page-level durable state; the Safekeeper owns the durable WAL. Both are independent of compute and persist across compute instance lifetimes.
- Compute layer. Independent Postgres VMs that can scale up, down, or to zero based on demand. These are ephemeral — they hold only scratch state (buffer pool, WAL artifacts in transit, temp files, performance caches).
- Lakebase Manager. The control-plane component that starts, stops, and terminates compute instances (e.g. on CMK revocation).
This shape is the same conceptual split as Neon's public architecture and is the forcing function behind Lakebase's two-tier encryption story. (Source: sources/2026-04-20-databricks-take-control-customer-managed-keys-for-lakebase-postgres)
Capabilities surfaced so far¶
Evolutionary database development substrate (2026-05-29)¶
Databricks' 2026-05-29 post (Part 1 of a three-part Evolutionary Database Development series) frames Lakebase's copy-on-write branching as the substrate that finally makes Martin Fowler's 2003 evolutionary-database-design methodology operationally default rather than aspirational. The load-bearing methodological argument: Practice #4 — "everybody gets their own database instance" has stayed aspirational for twenty years because per-developer production-shaped databases cost time, money, and DBA cycles; the compensating layer that emerged (mock objects, in-memory DB substitutes like H2 / SQLite, shared staging environments, DBA ticket queues) "became foundational methodology by default, not by design." Lakebase's copy-on-write branching at terabyte scale lifts that constraint: "In 2026, copy-on-write database branching arrives in Databricks Lakebase. A one-second, zero-storage-at-creation branch of a terabyte-scale production database is now an O(1) operation." (Source: sources/2026-05-29-databricks-enabling-evolutionary-database-development-database-branching-with-lakebase)
The post canonicalises three load-bearing properties of the
developer's branch — fast (created when needed), realistic
(same engine, same governance, production-shaped data), and
isolated (experiments don't interrupt others) — and argues
that all three holding simultaneously is the regime change. Each
historical compensating-layer component violates at least one:
mocks (not realistic), H2 / SQLite (not realistic), shared staging
(not isolated), local stale pg_dump (only partially realistic).
New named system in this post: Lakebase SCM Extension — the public open-source VS Code / Cursor IDE extension at github.com/databricks-solutions/lakebase-scm-extension — synchronises a developer's git branch with a matching Lakebase database branch and surfaces the Branch Diff Summary view that shows schema differences inline. The extension automates the per-developer paired-branch pattern and is the canonical artefact format for the schema-diff comment posted by CI in the CI-ephemeral-branch pattern.
The CI flow is canonicalised verbatim: "CI does what Jen just
did, but for the team: it creates its own temporary Lakebase
branch, applies the migration, runs the application test suite,
runs database tests against the migrated schema, validates the
migration itself (applies cleanly, idempotent, reversible), and
posts a schema-diff comment on the PR showing exactly which
database objects changed." This four-validation bundle (applies
cleanly + idempotent + reversible + application tests) is the
substrate change that absorbs the breakage-class question
previously held by the DBA, freeing the DBA for design
collaboration (concepts/dba-as-design-collaborator).
The evolutionary-database- design methodology lineage is canonicalised explicitly: Fowler 2003 essay → Sadalage 2006 Refactoring Databases with 70+ named refactorings (the Split Column refactoring is the worked example) → Humble & Farley 2010 Continuous Delivery Chapter 12 ("Managing Data") → 2026 Lakebase substrate change. The post's protagonist Jen — the same developer character Fowler used in the 2003 essay — applies the Split Column refactoring twenty years later: "Same Jen. Same refactoring. What changed is the capability." The migration tools cited as platform-agnostic (Flyway, Liquibase, Alembic, Knex, Prisma) are all compatible with the substrate; the substrate is what changed, not the tool ecosystem.
Forward references: Part 2 — Jen's New Playbook (architecture deep-dive on copy-on-write internals + methodology optimisations); Part 3 — Jen's Team at Scale (50-developer governance + DBA re-deployment + agent-in-the-loop); Companion: Plugin Walkthrough (Lakebase SCM Extension end-to-end); Lakebase App Dev Kit for agents with companion ebook.
Evolutionary database development — new playbook (Part 2, 2026-06-05)¶
Part 2 expands the seven practices → eleven practices, adding four new practices enabled by copy-on-write branching:
-
Practice #8: Destructive testing as default. Blast radius is zero on a branch; reset costs one second. Chaos-style destructive tests (kill migration midway, simulate partial DR, corrupt edge- case data) become routine development workflow — no ops calendar required. See patterns/destructive-testing-on-ephemeral-branch.
-
Practice #9: A/B variant prototyping at database level. Two competing schema designs on parallel branches; measure against production-shaped data; keep the winner; document the decision. See patterns/a-b-variant-prototyping-at-database-level.
-
Practice #10: Governance inherited by branches (deferred to Part 3). Unity Catalog policies follow each branch automatically.
-
Practice #11: Agent-as-practitioner (deferred to Part 3). Agents get branches, not production.
The post also strengthens existing practices:
- Practice #3 gains the idempotency authorship rule. Because migrations run against many branches, non-idempotent scripts are treated as bugs. See concepts/idempotent-migration.
- Expand-and-contract named as the canonical schema migration strategy — split irreversible work across migrations. See patterns/expand-and-contract-schema-migration.
- CI workflow fully templated with GitHub Actions:
pr.yml(per-PR branch creation + migration + test + schema-diff comment) andmerge.yml(branch cleanup on merge). - Four branches per feature validated as the working norm: per-developer, per-CI, and two A/B exploration branches — all in seconds, all isolated.
(Source: sources/2026-06-05-databricks-enabling-evolutionary-database-development-database-branchin)
Reliability roadmap for the agentic-workload era (2026-05-27)¶
The Databricks reliability post (Jasraj Dange, Hans Norheim, Stas Kelvich, John Spray) lays out six pillars of how Lakebase is re-engineered for reliability under agentic / on-demand / scale-to-zero workloads — "agents create 4× as many databases as humans do"; Databricks is "starting tens of millions of databases every day." (Source: sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures)
Pillar 1 — High-availability architecture from stateless Postgres compute + zone-redundant storage default-on for all tiers. Verbatim: "Unlike many cloud Postgres database service setups that are monolithic and have stateful compute, Postgres in the lakebase architecture is stateless. All durable data lives in a remote storage service, so the compute process holds no durable state on the local disk. If Postgres or the hardware it runs on fails, it can be instantly replaced without replicating data to a hot standby or running usual Postgres crash recovery." And: "In Lakebase and Neon, all databases, regardless of tier and configuration, are backed by distributed, zone-redundant, highly available storage. Data is stored in highly durable, zone-redundant object storage, and performance is accelerated by NVMe SSD caches across multiple availability zones at no additional cost to you." The architectural payoff: "a single-compute Postgres instance in Lakebase has significantly improved availability compared to a single stateful Postgres instance, without the cost of an additional hot standby compute instance." HA-tier customers additionally get "dedicated computes across multiple availability zones … ensuring that your database remains available even if the cloud provider runs out of capacity during (or as a result of) the failure event"; these computes can also scale reads. First-class wiki canonicalisation that the Pageserver+Safekeeper storage tier is structurally zone-redundant for every customer. Crash-recovery tax eliminated: "crash recovery must replay the write-ahead log from the last checkpoint, which scales with the write rate at the time of the crash and can take 10s of minutes, depending on configuration" — Lakebase's stateless-compute design replaces this with a near-instant compute swap.
Pillar 2 — Control plane is the new data plane; carve out a hot-path data-plane controller. Verbatim: "In monolithic cloud database service architecture, the data plane is the critical part of the service. It's designed for 99.99+% availability and static stability. The control plane matters 'only' for management operations. With agentic and on-demand workloads, the part of the control plane that starts databases is effectively the data plane. … We've had outages where background maintenance operations resource-starved on-demand database startups - that's clearly not ok. We're currently hard at work separating the critical parts of the control plane into a data plane controller service that handles only hot-path operations (start/suspend). This service has less business logic, a strict, minimal set of external dependencies, and is engineered from the ground up with resilience, graceful degradation, and defense-in-depth top of mind." Empirical signal: 90% of compute sessions for auto-suspending databases in Neon are <10 minutes — the start verb is on the synchronous request path of the median connection. Canonicalised pattern: patterns/separate-data-plane-controller-for-hot-path. The data-plane controller is in flight, not landed.
Pillar 3 — Critical-path dependency minimisation: bare-metal pool + own vertical-autoscaling virtualisation layer + own zone-resilient storage replace the cloud-provider control-plane chain. The post enumerates the five-link cold-start dependency chain in a traditional cloud-VM-hosted Postgres setup: cloud-provider compute control plane / VM-capacity policy / block-store control plane / networking control plane / Kubernetes system services. Lakebase's reply: "We allocate a pool of big (often bare metal) instances from the cloud provider. We carry buffers to sustain cloud provider provisioning outages. We built our own vertically autoscaling virtualization layer that schedules multiple Postgres instances onto those cloud instances. We don't rely on cloud block store devices, but instead store data in our own zone-resilient storage that is ultimately backed in object stores like S3 or Azure Blob storage." Canonicalised pattern: patterns/preallocated-bare-metal-pool-with-virtualization. The buffer-of-bare-metal-instances primitive is the statically stable realisation — collapse the cloud-provider control-plane chain to one already-completed dependency.
Pillar 4 — Cell-based architecture composes a region from N identically-shaped cells; canonical production-AZ-outage instance. Verbatim: "Rather than running a single monolithic regional deployment, Lakebase composes a region from one or more identically shaped cells. A cell is a complete, self-contained slice of the Neon and Lakebase stack: Kubernetes, control plane, compute, and storage." Cells double as the scaling unit — "To grow a region, we add another cell. When an existing Cell approaches scalability limits of Kubernetes and control plane, new project creation is routed to a freshly provisioned Cell." And the canonical production AZ-outage test on 2026-05-08: "During an incident on May 8, 2026, when AWS experienced issues with an Availability Zone in us-east-1, one of the cells had issues failing over to healthy nodes. The impact was contained to that cell. The other seven cells in the region failed over correctly, so the incident affected only ~13% of databases in the region. In this case, the cell-based architecture reduced the impact by roughly an order of magnitude." First quantified production-instance of cell-as-blast-radius-reduction in this wiki: ~13% of databases = ~1/8 of fleet (eight cells in us-east-1, one imperfectly), an order-of-magnitude blast-radius reduction vs the monolithic-regional alternative. See concepts/blast-radius for the wiki canonicalisation.
Pillar 5 — Failure simulation + injection: failpoints + chaos escalating to whole-AZ network partition. Verbatim on the component-level regime: "Every Lakebase release goes through failure injection and chaos testing before it goes to production. We deploy the release to a real cluster, drive it with a mix of agentic and non-agentic OLTP and OLAP workloads at stress-level concurrency, and then start breaking things underneath. We kill processes, shoot down nodes, inject network failures, wipe disk contents, and restart components in loops, all while the workload keeps running. We use failpoints liberally in our code to inject hard-to-reproduce errors, such as a crash at the worst possible time. This is driven by an internal fault-injection framework that can target a single process or coordinate cluster-wide faults across an entire cell." Correctness validated by SQLancer + SQLsmith + internal tools — "While failure injection is running, we validate internal data consistency, that no committed transaction is lost, and that every component recovers to a consistent state on its own." Next-level escalation (in-flight): "We're now taking this one level up, from component-level chaos to whole-AZ down simulations. In a real cluster with workloads running, we programmatically disconnect an availability zone's network from the rest of the cluster and observe how the system reacts: how quickly storage shifts to surviving replicas, how fast computes are failed over to healthy AZs, how the proxy layer reroutes connections, and how long any individual database sees an outage. Our goal is that no workload should be down for more than 30 seconds." See concepts/whole-az-network-partition-simulation + patterns/whole-az-network-partition-drill for the wiki canonicalisation. The 30-second per-database outage target is aspirational (in-flight regime).
Pillar 6 — Per-database availability attainment as the SLO measurement substrate. Verbatim: "Database Availability: How many percent of the time every individual database is available. We don't just measure aggregate fleet availability, because an individual customer doesn't care if the fleet had great availability if their database was down." And: "Our goal is for every database to exceed 99.99% availability every month. We measure how close we are to that goal with attainment: How many % of the fleet's databases that met the goal." Disclosed 2026 H1 attainment table:
| Month | Met 99.95% | Met 99.99% |
|---|---|---|
| 2026-01 | 99.96% | 99.85% |
| 2026-02 | 99.95% | 99.84% |
| 2026-03 | 99.96% | 99.81% |
| 2026-04 | 99.93% | 99.75% |
Attainment is one of five disclosed Lakebase SLIs (verbatim list): database availability / database startup time (the serverless-specific SLI — see concepts/database-startup-time-sli) / database switchover/failover frequency + latency / storage page-read + durable-write availability + latency / control-plane API success rate + latency. The startup-time SLI is structurally absent from monolithic-Postgres SLO menus because monolithic Postgres is always- on; under scale-to-zero every connection arrival hits it. Public status page: https://neonstatus.com/ (high-level); internal attainment is higher-resolution. Canonicalised pattern: patterns/per-database-availability-attainment.
Author lineage. The authors are "people who have spent careers building and operating relational databases" — Jasraj Dange (prev. Azure SQL Database Performance, Scalability), Hans Norheim (13y at Microsoft on SQL Server / Azure SQL Database, including hot-patching + upgrade orchestration that holds Azure SQL Database to its 99.995% uptime SLA), Stas Kelvich (co-founded Neon; before Neon at Postgres Professional on Postgres internals — multi-master replication with quorum commit, cross-node snapshot isolation with loosely-synchronised clocks, two-phase commit + logical replication), John Spray (leading Lakebase Storage; previously Redpanda + Red Hat (Ceph) + Intel for distributed storage).
Customer-Managed Keys (CMK) for Lakebase (2026-04-20)¶
- Three-level concepts/envelope-encryption hierarchy: CMK in customer's cloud KMS → KEK in Databricks Key Manager Service → DEK per data segment.
- Supported KMSes: systems/aws-kms, systems/azure-key-vault, systems/google-cloud-kms (identified by ARN, Key Vault URL, Key ID respectively).
- Persistent layer encryption via the envelope hierarchy over Pageserver + Safekeeper data and WAL segments.
- Ephemeral layer encryption via per-instance per-boot ephemeral keys for Postgres-VM-local state; CMK revocation triggers Lakebase Manager to terminate the compute instance.
- concepts/cryptographic-shredding is the revocation semantics across both layers.
- Seamless key rotation (no bulk re-encryption) is a property of the envelope hierarchy.
- Account↔Workspace delegation: Account Admin creates Key Configuration, binds to Workspace; new Lakebase projects in the workspace inherit the CMK.
- Availability: Enterprise tier customers.
Bursty agentic-workload serverless OLTP substrate (2026-04-27)¶
LangGuard — a startup building runtime governance for enterprise agentic workflows — is profiled as one of the first production deployments of Lakebase. The case study is architecturally valuable because it names three Lakebase properties that together justified the choice over a coupled-compute/storage Postgres:
- Serverless autoscaling + scale-to-zero between bursts. Agent workflows are "notoriously bursty" — dormant for hours, then hundreds of trace writes + enforcement reads in seconds (see concepts/bursty-query-pattern for the workload shape). Lakebase provisions compute on burst arrival and shuts down completely when activity stops; operational costs stay aligned with actual workload rather than provisioned for peak.
- Compute-instance attach to existing storage with no data movement. "Because durable state lives in the storage layer, not in the compute node, spinning up a new compute instance requires no data movement. It simply attaches to the existing database history and begins serving queries immediately." This is the operational payoff of concepts/compute-storage-separation specifically for burst workloads — no cold-start penalty on the data side.
- Millisecond reads via compute-local cache. A caching layer between compute and storage keeps hot data close to compute; a working set of "tight indexed lookups against GRAIL context and policy tables" fits in compute-local memory, making governance decisions fast enough to stay off the agent's latency critical path.
- Instant database branching via copy-on-write. "When we create a branch, no data is physically copied. The branch diverges from the current database state using copy-on-write semantics, consuming storage only for new or modified data." LangGuard uses this to clone production trace data in seconds and test new governance policies against real agent behavior — see concepts/database-branching, concepts/copy-on-write-storage-fork, and the dedicated pattern patterns/policy-testing-via-database-branching.
- Postgres compatibility reduces migration risk. The LangGuard team cites "full compatibility with the tools, libraries, and extensions our team already knows" as a reason they could move straight onto Lakebase without a rewrite — the Neon-lineage design keeps upstream Postgres semantics while rewriting the storage tier.
The LangGuard team comes out of IBM QRadar (SIEM at petabytes/day) and explicitly frames their choice as the answer to a QRadar-era constraint: "Traditional databases that couple compute and storage force you to provision for peak load and pay for that capacity around the clock. Lakebase's serverless model … was the answer we had always needed but didn't have access to when we were building QRadar." This is the first canonical wiki articulation of the bursty security-telemetry → serverless-OLTP fit from an operational prior. See systems/langguard and systems/grail-data-fabric for the full product + architecture framing.
(Source: sources/2026-04-27-databricks-inside-one-of-the-first-production-deployments-of-lakebase-langguard)
Agent-provisioned database via Stripe Projects (2026-04-29)¶
Databricks is named as a launch partner for Stripe Projects — the agent-provisioning CLI Stripe launched 2026-04-30. The Databricks side of the integration exposes Neon Postgres databases ("Lakebase architecture, developed by Neon") as provisionable resources through the Stripe Projects catalog, making Lakebase the second launch-side provider class in the agent-provisioning protocol after Cloudflare's domain / Workers resources.
Load-bearing disclosures from the launch:
- <350 ms agent-driven provisioning time. Verbatim: "Agents can now get a production-ready Neon Postgres database in under 350ms, without any human interaction." First wiki operational datum on Lakebase's compute-lifecycle latency at sub-second resolution; prior ingested sources (CMK 2026-04-20, LangGuard 2026-04-27) disclosed the architectural separation but not a compute-spin-up time number.
- Neon ≡ Lakebase at architecture altitude. The launch post collapses the two names in one paragraph: "Lakebase architecture, developed by Neon, is the first serverless Postgres database designed for the AI era" + "bringing Neon databases seamlessly to every AI development environment". Prior Lakebase posts treated Neon as the inherited-lineage component family (Pageserver + Safekeeper); this source makes the productised identity explicit.
- Three architectural pillars re-stated. The post enumerates the reason agent-driven infrastructure is viable on this substrate as a numbered list: serverless scaling + scale-to-zero ("automatically scale to zero when idle"); instant database branching via zero-copy cloning; Postgres compatibility ("agents understand Postgres better than any other OLTP database"). All three are already canonicalised on this page; this is the first wiki instance of them being named together as the agent-provisioning substrate contract.
- Compute-storage separation as agent-lifecycle enabler. Verbatim: "By decoupling compute from storage, agents can create, build, and tear down OLTP databases in seconds." Third canonical cross-source confirmation of concepts/compute-storage-separation as Lakebase's load-bearing property after CMK (two-tier encryption forcing function) and LangGuard (no-data-movement on compute spin-up for bursty workloads). New axis: per-request compute lifecycle at agent-initiated cadence — the separation is what makes every agent speculatively-creating-then-dropping databases structurally cheap, not just operationally tolerable.
New canonical concept introduced: concepts/agent-provisioned-database — Lakebase/Neon via Stripe Projects is its first canonical instance. The concept specialises concepts/agent-provisioned-account to the database-resource tier and articulates the three-substrate- pillar contract (sub-second provisioning + scale-to-zero + compute-storage separation) databases must satisfy to play the role.
Not disclosed: - Spend-cap default for agent-provisioned Lakebase/Neon instances through Stripe Projects. - Rate-limit / fraud policy for churn-for-harvesting abuse patterns. - Whether the <350 ms number holds under concurrent-burst load or is typical-case single-shot. - Orphan-database cleanup / TTL behaviour when the agent session ends before teardown.
(Source: sources/2026-04-29-databricks-and-stripe-projects-infrastructure-built-for-agents)
State-heavy-application Postgres replacement + branching + PITR (2026-04-30)¶
Thoughtworks ran a proof-of-concept ripping Backstage (Spotify's state-heavy Internal Developer Portal) off its standard Postgres database and pointing it at Lakebase. Fourth canonical wiki source on Lakebase after CMK (2026-04-20), LangGuard (2026-04-27), and Stripe Projects (2026-04-29). The POC is architecturally interesting because:
- Wire-protocol-Postgres compatibility is the integration
property. Verbatim: "Because it speaks wire-protocol Postgres,
Backstage doesn't know or care that it isn't talking to RDS."
Backstage's Knex migrations ran cleanly
against Lakebase; the only integration change was swapping
PgSearchEnginefor Backstage's default in-memory search. - Auth was the friction point, not the protocol. Lakebase
rejects classic Databricks Personal Access Tokens and expects
an OAuth JWT
instead. The
databricks postgres generate-database-credentialcommand mints a scoped, short- lived JWT for a specific endpoint — "the intended approach for apps and CI." Thoughtworks wrapped the command in a 50-minute cron rewritingDATABRICKS_TOKENin.env— canonical patterns/credential-refresh-cron-as-auth-compat-shim. - First wiki disclosure of branching throughput at MB-scale dataset granularity. The 63 MB Backstage catalog branch landed in 1.09 seconds data plane (control-plane ack was instant). Prior sources disclosed branching latency only as "seconds" (LangGuard) or "sub-350 ms" for cold Postgres provisioning (Stripe Projects). This source separates control-plane acknowledgement from data-plane clone, confirming the copy-on-write architecture predicts near-constant time at this scale.
- First wiki disclosure of Point-in-Time Recovery at Lakebase
altitude. Canonical concepts/point-in-time-recovery instance:
wipe of
final_entities(32 rows → 0), then recovery branch from a pre-wipe timestamp, end-to-end in 3.78 seconds. Production still at zero during recovery (branches fully isolated). - WAL-record granularity disclosed. Requested 22:56:02Z, got 22:55:50Z (12 seconds earlier) — PITR snaps backward to the nearest WAL record. First wiki instance of concepts/wal-record-granularity as a first-class property with disclosed concrete snap-back window. Flagged as "an important caveat for time-sensitive recovery workflows."
- Architectural unification: branching ≡ PITR-with-time-now. Canonical statement: "Branching and Point-in-Time Recovery (PITR) are essentially the same primitive: branching is just PITR with source_branch_time = now." Canonicalises patterns/branching-is-pitr-with-time-now — same control- plane call, same storage substrate, same compute-attach step; only the time parameter differs. Latency envelopes confirm (1.09 s vs 3.78 s, same order of magnitude).
- Branch API requires a
spec-nested body with explicit lifetime declaration. Undocumented gotcha surfaced by the POC: "the request body must nest everything inside a spec object, and you must specifyttl,expire_time, orno_expiry. Without that, the API returns 'Expiration must be specified.'" Branches are short-lived by default — long-lived-ness requires explicit opt-in. - Developer-cycle transformation thesis. Paired with the numbers, the POC argues that cheap branching deprecates 20-30% of test code (mock objects — "not test coverage, that's test infrastructure"). See concepts/mock-object-maintenance-cost + concepts/integration-tests-against-real-database + patterns/database-branch-per-test-over-mocking. The before state's "discover at staging deploy that schema migration doesn't work against real data" pain point is the load-bearing cost cheap branching eliminates.
Part 1 of a three-part Thoughtworks series; Parts 2 (Governance) + 3 (FinOps) forthcoming. Backstage as the state-heavy IDP application makes the POC representative — not a toy workload.
(Source: sources/2026-04-30-databricks-backstage-with-lakebase)
Image-generation pushdown: 5× writes, 94% WAL reduction (2026-05-07)¶
Fifth canonical Lakebase ingest + first mechanism-level disclosure of the pageserver's internals beyond the name-level framing prior Lakebase sources provided. The Databricks Engineering team eliminated classical Postgres's Full Page Write (FPW) tax on the compute side by exploiting concepts/compute-storage-separation and moving image generation into the distributed storage layer.
The architectural insight. Classical Postgres uses FPW to protect against torn pages — crash-mid-write leaving a page partially new and partially old on local disk. Recovery trusts WAL-resident full page images rather than the possibly-torn on-disk copy. On Lakebase:
In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers. Because there is no local-disk page to tear, the failure mode FPW was designed to prevent simply does not exist.
First wiki-explicit framing of the Paxos-based safekeeper quorum as the durability substrate underneath Neon-lineage Postgres. The torn-page failure mode becomes architecturally impossible (not merely mitigated), so compute-side FPW is structurally redundant.
The catch. FPW had an incidental load-bearing role in concepts/delta-chain-replay on the read path: periodic full page images in the WAL stream acted as reset points for the pageserver's delta-chain replay. Disabling FPW without remediation would let delta chains grow unboundedly, regressing read latency.
The solution — image-generation pushdown. The pageserver now generates images locally when a page accumulates more delta records than a configured threshold without an intervening image. Compute sends only compact deltas; storage materialises images in the background on its own cadence based on actual page-change rate (not the unrelated checkpoint cadence).
Three structural benefits named in the post: - Network efficiency: compute WAL traffic −94%. - Scalability: "Image generation for a project branch is now shared across multiple pageservers in the background" — work moved from a single Postgres writer to a horizontally scalable storage fleet. - Optimal reads: image cadence per-page-change-rate, not per-checkpoint.
Quantified impact. HammerDB TPROC-C OLTP benchmark:
| Compute size | Before (NOPM) | After (NOPM) | Gain |
|---|---|---|---|
| 4 vCPU | 78,876 | 94,891 | +20% |
| 16 vCPU | 95,832 | 269,189 | 2.8× |
| 32 vCPU | 95,686 | 439,300 | 4.5×+ |
Pre-change 16v → 32v was flat (95,832 → 95,686) — FPW was the bottleneck, compute resources unused. Post-change scaling is linear. WAL volume per transaction: 58 KB → <4 KB (94% reduction).
Production customer (56 vCPU workload): WAL rate 30 MB/s → 1 MB/s (30× reduction). Read-path dividend: p99 read latency −30% to −50%; p50 −~30% (because bounded delta chains are now cadence-decoupled from checkpoints). Regional fleet-level: WAL down up to 4×, p99 storage-engine reads up to 3× better and "much more stable." Synced Tables ingestion on one customer: 17k → 62k rows/sec (3.6×).
Seamless rollout. "Since late March" (2026) → active
globally 2026-05-07 — ~6-week rollout window across
all Lakebase Serverless + Neon databases globally via the
existing Postgres XLOG_FPW_CHANGE WAL record mechanism. No
customer restarts or interruptions. Canonicalised as the
patterns/live-wal-protocol-switch-via-xlog-fpw-change
pattern — use a pre-existing Postgres control record as an
in-log feature flag to switch the compute-storage protocol
contract atomically per-compute.
New wiki canonicalisations from this source: four concepts (concepts/postgres-full-page-write + concepts/torn-page + concepts/postgres-checkpoint + concepts/delta-chain-replay) + two patterns (patterns/image-generation-pushdown-to-storage + patterns/live-wal-protocol-switch-via-xlog-fpw-change) + one system (systems/hammerdb). The pageserver+safekeeper page is extended with a new explicit responsibility (image generation on storage-side thresholds) and the Paxos-quorum framing for the safekeeper.
The post is part of a broader arc: "Pushing down full page writes is part of a systematic effort to harvest the benefits of storage and compute separation." Sibling to cache prewarming for zero-downtime patching (earlier Lakebase post) — the common thread is moving heavy-lifting tasks away from per-transaction hot paths into scalable background storage-tier processing.
(Source: sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes)
In-workspace app state store (2026-05-13)¶
The 2026-05-13 Clinical operations intelligence belongs on the Lakehouse post surfaces a new face: Lakebase as the app-tier state store inside a Databricks App, not as a standalone OLTP service. Three design properties carry weight:
- Provisioned + credentialed by the workspace identity system. "Where a traditional application would require a separately managed RDS instance with its own schema drift, sync jobs, and credential rotation, Lakebase is in the same platform where the data and models live." The app's service principal is also the Lakebase identity — no separate secrets store, no rotation cron, no cross-system credential surface. Composes with the single-platform application architecture thesis.
- Scale-to-zero when idle. "Managed PostgreSQL that scales to zero when idle" is positioned as an app-tier-state-store property (not just an analytical-companion property): when the decision-support app has no users, the app-state DB costs nothing.
- The "operational state" that doesn't belong in UC analytical tables. Site Feasibility Workbench uses Lakebase to persist "saved shortlists for team sharing" — per-user/per-team mutable state with no analytical semantics. UC remains the substrate for the analytical tables the app reads and the SHAP attribution audit tables the app writes.
The architectural shift this face surfaces: Lakebase is no longer just "the OLTP companion to the Lakehouse" (the framing of the prior four ingests); it's also "the app-tier state store that lets a decision-support app live inside the workspace without a separate RDS + sync + credential-rotation surface." Sixth canonical face for Lakebase on the wiki (after CMK / LangGuard agentic-OLTP / Stripe-Projects agent-provisioning / Backstage state-heavy-app / FPW image-generation-pushdown).
(Source: sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouse)
Bidirectional governed-data path + LFC observability (2026-05-20)¶
The 2026-05-20 marketing-campaigns post is the first wiki source to surface three new architectural elements:
- Synced Tables with three sync modes. Snapshot / triggered / continuous, with the load-bearing operational disclosure that "when more than 10% of the data is updated, we recommend snapshot mode, which delivers 10x better performance than triggered mode." The 10% / 10× rule of thumb is canonicalised as the patterns/snapshot-sync-mode-for-batch-rebuild pattern. The decision variable is delta proportion per cycle, not cadence — a daily nightly recompute that replaces 80% of segments still belongs on the snapshot side. Synced Tables are the Delta → Postgres half of Lakebase's bidirectional governed-data path.
- Lakehouse Sync as a continuous CDC pipeline. The other half: "a native, continuous CDC-based pipeline from Lakebase Postgres to Unity Catalog Delta tables that makes operational data available for richer analytics and AI." Together with Synced Tables, this closes the loop — application writes into Lakebase are automatically available in the analytical lakehouse without hand-maintained CDC pipelines or dual-write coordination in the application. Instance of concepts/change-data-capture applied to operational-to-analytical flow.
- Local File Cache (LFC) + PREFETCH + FILECACHE metrics.
First wiki disclosure of
LFC as Lakebase's compute-VM-local cache layer that softens
the storage-compute-separation latency penalty, alongside the
two Lakebase-specific Postgres query statistics:
PREFETCH(prefetch requests issued/hit/wasted) andFILECACHE(LFC hits/misses). These are the load-bearing observability layer for diagnosing query problems specifically attributable to the storage-compute boundary — distinct from standard Postgres tuning surface (pg_stat_statements,work_mem,autovacuum_vacuum_scale_factor), which the post confirms is unchanged.
The post also discloses a concrete production sizing for a bursty marketing-campaign workload: scale-to-0 minimum, 16 CU (~32 GB RAM) maximum on Lakebase Autoscaling, with the architectural justification that "Lakebase autoscaling speed and reactivity eliminate the risk of resource underutilization" — sub-second scale-down makes generous max-cap sizing safe. Marketing-campaign customer segments are positioned as the canonical instance of concepts/bursty-query-pattern applied to OLTP (rather than to observability databases as in the Pyroscope 2.0 framing) — "there is a spike in database requests, but otherwise, database utilization is low."
A secondary disclosure: OAuth's hourly token rotation is incompatible with non-Databricks-aware partner systems like SAP Engagement Cloud, forcing a fall-back to native Postgres password roles with operator-managed rotation. Canonicalised as patterns/native-postgres-roles-for-non-databricks-aware-partners — the explicit escape hatch for integrating partners that expect long-lived database credentials. Lakebase TLS uses Let's Encrypt; partner systems must trust ISRG Root X1.
Seventh canonical face for Lakebase on the wiki (after CMK / LangGuard / Stripe-Projects / Backstage / FPW / in-workspace-app / governance-substrate-unification): the bidirectional governed-data plane between operational and analytical tiers, with LFC as the compute-side observability layer for the storage-compute boundary itself.
(Source: sources/2026-05-20-databricks-marketing-campaigns-with-lakebase)
Relationship to other Databricks / wiki systems¶
- systems/postgresql — Lakebase is Postgres-compatible; compatible posture with systems/aurora-dsql's "extend, don't fork" idiom though the internal architecture differs (Aurora DSQL swaps concurrency/durability/storage via Postgres extensions; Neon- lineage Lakebase splits page/WAL storage off as Pageserver + Safekeeper services).
- systems/pageserver-safekeeper — the storage-tier components Lakebase inherits from Neon.
- systems/unity-catalog — Databricks' governance substrate; the Account-Console key-configuration flow is the same Account ↔ Workspace shape UC is administered through.
Caveats¶
- Every statement here is sourced from one launch post; Lakebase's own internals (replication, HA, scale-to-zero cold-start times, commit protocol, compute autoscaling policy) have not been documented in the ingested corpus yet. Future ingests should expand this page.
- Neon-lineage is inferred from the Pageserver / Safekeeper
terminology used in the CMK post; Databricks' Neon acquisition
(2025) was previously logged as skipped in
log.md(pure PR) and is not a formally ingested source.
Seen in¶
-
sources/2026-05-20-databricks-marketing-campaigns-with-lakebase — Bidirectional governed-data-path face + LFC observability face. First wiki disclosure of three new architectural elements: Synced Tables with three sync modes (snapshot / triggered / continuous) and the load-bearing 10% / 10× rule of thumb canonicalised as the patterns/snapshot-sync-mode-for-batch-rebuild pattern; Lakehouse Sync as a continuous CDC pipeline closing the operational → analytical loop; Local File Cache (LFC) with
PREFETCHandFILECACHEquery metrics as the storage-compute-boundary observability layer. Concrete production sizing for a bursty marketing-campaign workload (scale-to-0 → 16 CU, ~32 GB RAM) positions Lakebase Autoscaling as the canonical fit for the bursty OLTP shape — "there is a spike in database requests, but otherwise, database utilization is low." Secondary disclosure: OAuth hourly token rotation incompatibility with non-Databricks- aware partners (SAP Engagement Cloud) forces the native-Postgres-password-roles escape hatch. Postgres tuning surface (pg_stat_statements,work_mem256 MB on larger compute,autovacuum_vacuum_scale_factorfor high-churn tables) confirmed unchanged. Customer: Deichmann (European footwear retailer). Tier-3 product post; ingest justified by genuinely-new disclosures (sync modes / LFC / Lakehouse Sync) distinct from prior Lakebase-architecture ingests. -
sources/2026-05-15-databricks-backstage-with-lakebase-part-2 — Eighth Lakebase face: governance-substrate-unified operational DB inside Unity Catalog. Part 2 of the Thoughtworks Backstage-with-Lakebase series. Verbatim: "Because Lakebase is natively embedded inside Databricks, Unity Catalog extends directly over the operational Postgres database … We didn't just change where the data lived; we changed where the access policy lived." The mechanism: Lakehouse Federation exposes the Backstage catalog as foreign catalog
lakebase_bsin UC; standard UC GRANTs replace Postgres-level role management; every control-plane Lakebase action lands insystem.access.audit; compute costs break down automatically by(project_id, branch_id, endpoint_id)against UC system billing tables (production branch31.6130 DBU, transient test branch0.0107 DBU); UC attribute-level masking policies propagate automatically to every Lakebase branch at creation time (concepts/branch-level-governance-propagation). Two open- source Thoughtworks tools deployed as Databricks Apps complete the role-shift: LakebaseOps (three agents replacing 51 DBA tickets, seven scheduled Databricks Jobs replacing pg_cron, 9-KPI adoption dashboard, ten-engine migration wizard) and Lakebase MCP (46-tool MCP server with dual-layer governance — SQL- statement guard + per-tool access guard across four profilesread_only/analyst/developer/admin, plus per-statement tool- tag attribution). Canonical instances of concepts/operational-analytical-governance-unification + patterns/foreign-catalog-federation-for-operational-db-governance. "One SQL query instead of three services." Borderline-include Tier-3 with ~70% architecture density; second of a three-part series (Part 3 = FinOps). -
sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library — Entity-Resolution-state-store + HITL-feedback-store face. Seventh Lakebase face on the wiki: not just agent-provisioned-database / branching-PITR-substrate / langguard-runtime-enforcement-store / 5x-Postgres-writes / app-tier-state-store-without-its-own-credential-surface (the prior six faces) but the transactional asset-mapping store for an Entity Resolution catalog at 17M+ asset scale, with Postgres constraints explicitly load-bearing for ER data integrity. "For the 'Library' to work, the data must be consistent and highly available. Claroty integrates Lakebase, a fully managed transactional data layer on Databricks. Lakebase is built on Postgres and provides the low-latency performance required for real-time queries while maintaining a seamless link to the broader Lakehouse for analytical processing, allowing strict constraints to make sure our data keeps its high quality and ensuring that asset mappings remain accurate even as configurations drift." The HITL composition: "With the Databricks App and Lakebase, we enable a transparent view and a seamless 'human-in-the-loop' feedback cycle. This intuitive interface allows domain experts to review classifications, correct and enrich entities, and feed high-fidelity, validated data back into our MLflow pipelines and R&D migration." Two structural contributions: (a) Postgres FK / unique constraints prevent duplicate CPS-IDs and orphan classifier-version references, not as advisory checks but as enforced data-integrity guarantees; (b) Lakebase + systems/databricks-apps is the end-to-end HITL substrate where SME corrections land transactionally and immediately update the canonical identity. Composes with patterns/orchestrated-multi-agent-entity-resolution (the HITL leg of the multi-agent ER pattern). Canonical wiki instance: systems/claroty-cps-library.
-
sources/2026-05-13-databricks-clinical-operations-intelligence-belongs-on-the-lakehouse — Sixth canonical Lakebase ingest; first wiki source naming Lakebase as the app-tier state store inside a Databricks App. New face: not just the OLTP companion to the Lakehouse but the operational-state layer for in-workspace decision-support apps where "a traditional application would require a separately managed RDS instance with its own schema drift, sync jobs, and credential rotation." Three property guarantees the post leans on: (a) provisioned + credentialed by the workspace identity system (the app's service principal is the Lakebase identity, no separate secrets store), (b) scale-to-zero when idle (app-tier state-store costs zero when no users), (c) app-state lives here, analytical state lives in UC (saved shortlists in Lakebase; site features + predictions + SHAP attributions in UC governed Delta tables). Canonical reference implementation: systems/site-feasibility-workbench. Composes with concepts/single-platform-application-architecture thesis — Lakebase is one of the four primitives (with Apps + UC + Genie) that "eliminates the integration layers, not by abstracting them away but by making them unnecessary."
-
sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes — Fifth canonical Lakebase ingest + first mechanism-level disclosure of the pageserver's internals. The Databricks Engineering team disabled compute-side Full Page Writes across the global Lakebase + Neon fleet by exploiting concepts/compute-storage-separation — stateless compute streaming WAL to a Paxos-based safekeeper quorum means torn pages structurally don't exist, so the FPW primitive is redundant. To preserve bounded delta-chain replay on reads, image generation was pushed down from the compute's WAL stream into the pageserver (per-page-threshold cadence replacing checkpoint-scoped FPW cadence). Canonicalises two new patterns: patterns/image-generation-pushdown-to-storage (architectural) + patterns/live-wal-protocol-switch-via-xlog-fpw-change (rollout). Quantified outcomes: 94% compute WAL volume reduction (58 KB/txn → <4 KB); 5× write throughput at 32 vCPU on HammerDB TPROC-C (95,686 → 439,300 NOPM); linear compute-size scaling (previously flat due to FPW bottleneck); 30 MB/s → 1 MB/s WAL rate on a 56-vCPU production customer (30× reduction); p99 read latency −30% to −50%; p50 read latency ~−30%; regional fleet WAL down up to 4×; Synced Tables ingestion 17k → 62k rows/sec (3.6×). Rollout: since late March 2026 → globally active 2026-05-07 (~6-week window) via the existing Postgres
XLOG_FPW_CHANGEWAL record mechanism with zero customer restarts. New canonical HammerDB wiki reference; pageserver gains image-generation responsibility; Postgres checkpoint cadence canonicalised as distinct from delta-chain-reset cadence for the first time. -
sources/2026-04-30-databricks-backstage-with-lakebase — Fourth canonical wiki source on Lakebase; state-heavy- application Postgres replacement at developer-cycle altitude. Thoughtworks (guest post on Databricks Blog) migrates a Backstage POC off standard Postgres onto Lakebase to demonstrate what happens to database development cycles when branching + PITR become effectively free. Introduces three new load-bearing operational datums on Lakebase: (a) 1.09-second data-plane branch creation for a 63 MB dataset — first wiki branching-throughput measurement at MB-scale granularity; (b) 3.78-second end-to-end PITR with a 12-second WAL-snap-back window — first wiki PITR instance at Lakebase altitude; (c) classic Databricks PATs are rejected; OAuth JWTs are required via
databricks postgres generate-database-credential— the intended auth path for apps + CI, bridged in the POC with a 50-minute credential-refresh cron. Canonicalises three new concepts (concepts/point-in-time-recovery, concepts/wal-record-granularity, concepts/mock-object-maintenance-cost), three new patterns (patterns/branching-is-pitr-with-time-now, patterns/database-branch-per-test-over-mocking, patterns/credential-refresh-cron-as-auth-compat-shim), and two new systems (systems/backstage, systems/databricks-postgres-cli). Architectural unification: "branching is just PITR with source_branch_time = now" — same primitive, different time parameter, same 1-4-second latency envelope. Developer-cycle thesis: cheap branching deprecates 20-30% of test code (mock objects), shifts schema-migration validation from staging-deploy to development, removes staging-environment contention as a QA bottleneck. Part 1 of a three-part Thoughtworks series; Parts 2 (Governance) + 3 (FinOps) forthcoming. -
sources/2026-04-29-databricks-and-stripe-projects-infrastructure-built-for-agents — Launch-partner status for Stripe Projects. Third canonical wiki source on Lakebase. Adds three load-bearing datums to the system's wiki canon: (a) the <350 ms agent-provisioning latency for a production-ready Neon/Lakebase Postgres through Stripe Projects (first operational datum on Lakebase's compute-lifecycle at sub-second resolution); (b) the Neon ≡ Lakebase architectural-identity collapse (prior sources framed Neon as inherited-lineage components — Pageserver + Safekeeper — this one names both interchangeably); (c) the three-pillar agent-provisioning substrate contract (scale-to-zero + instant copy-on-write branching + Postgres compatibility), naming them together for the first time as the reason agent-driven infrastructure is viable on this substrate. Compute-storage separation is named as the load-bearing forcing function under all three pillars — "By decoupling compute from storage, agents can create, build, and tear down OLTP databases in seconds." Canonicalises a new concept concepts/agent-provisioned-database (Lakebase/Neon via Stripe Projects as its first canonical instance) and puts Lakebase into two agent-provisioning-protocol- adjacent patterns at provider-side altitude: patterns/agent-provisioning-protocol (as the second launch-side provider class after Cloudflare's domain / Workers resources) and patterns/partner-managed-service-as-native-binding (Stripe-Projects-provisions-Databricks/Neon as a new instance of the agent-symmetric generalisation).
-
sources/2026-04-27-databricks-inside-one-of-the-first-production-deployments-of-lakebase-langguard — One of the first production deployments of Lakebase: LangGuard's agentic workflow governance engine. Canonicalises the three-property fit for bursty agentic workloads (scale-to-zero + compute-local millisecond cache + instant copy-on-write branching) and introduces the first concrete production use case for Lakebase's database branching at governance-policy-testing altitude (distinct from the schema-change-testing use case branching is usually associated with). Names QRadar as the empirical prior: "database architecture is destiny" for bursty telemetry-shaped workloads.
-
sources/2026-04-20-databricks-take-control-customer-managed-keys-for-lakebase-postgres — Customer-Managed Keys rollout; articulates Lakebase's storage/compute separation as the forcing function for a two-tier envelope-encryption design + per-boot ephemeral compute keys.
-
sources/2026-06-12-databricks-enabling-evolutionary-database-development-database-branchin-part3 — Part 3 of the evolutionary-database-development series. Introduces the tier topology (six environments collapse to one parent with long-running branches), promotion-is-merge deployment model, DBA-to-platform-engineer role evolution, and the Lakebase App Dev Kit with its five-state SCM state machine for agent + human governance. Neon data: ~500K branches/day, 80%+ agent-created.