Databricks — How Lakebase architecture delivers 5x faster Postgres writes¶
Databricks Engineering post (2026-05-07) on the Lakebase / Neon team eliminating the Full Page Write tax in Postgres by moving full page image generation out of the compute's WAL stream and into the distributed storage layer. This is the fifth canonical Lakebase ingest on the wiki after CMK (2026-04-20), LangGuard (2026-04-27), Stripe Projects (2026-04-29), and Backstage (2026-04-30) — and the first mechanism-level disclosure of the pageserver's internals beyond the name-level framing that prior sources established.
Architectural payoff quantified: 5× write throughput at 32 vCPU,
94% WAL-traffic reduction, p99 read latency down 30–50%,
p50 read latency down ~30%, Synced Tables ingestion 17k → 62k
rows/sec (3.6×). Rolled out across the global Lakebase + Neon
fleet "since late March" via the existing Postgres
XLOG_FPW_CHANGE WAL record with no customer restarts.
One-paragraph summary¶
Classical Postgres's durability design has a hidden tax: after every
checkpoint, the first modification
to any 8 KB page writes the entire page into the WAL — a Full
Page Write (FPW) — so that recovery can repair a
torn page (partial disk write across a crash
boundary) without needing to trust the on-disk copy. On
write-heavy workloads FPW can inflate log volume by up to 15×
and becomes the system's biggest bottleneck. In Lakebase's
compute-storage-separated architecture, compute is stateless and
streams WAL to a Paxos-based quorum of safekeepers, so there is
no local-disk page that can tear — the failure mode FPW was designed
for does not exist. Naively disabling FPW, however, creates an
unbounded-delta-chain problem on reads: without periodic full page
images in the log, the pageserver has to replay an ever-longer
chain of small deltas to reconstruct a page. The team resolved this
by pushing image generation down from the compute's WAL stream
into the pageserver: when a page accumulates more than a
configurable threshold of delta records without an intervening
image, the pageserver generates one itself. Because the decision is
based on actual changes to a page rather than the unrelated
checkpoint process, image generation is both better-targeted
and shared across multiple pageservers in the background.
Compute now sends only compact deltas (58 KB/transaction →
under 4 KB — 94% reduction). Rolled out fleet-wide via the existing
Postgres XLOG_FPW_CHANGE WAL record mechanism with zero customer
restarts.
Key takeaways¶
-
FPW is a torn-page insurance premium compute no longer needs. Verbatim: "In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers. Because there is no local-disk page to tear, the failure mode FPW was designed to prevent simply does not exist." This is the load-bearing architectural consequence of concepts/compute-storage-separation specific to Postgres's durability design — one that neither Neon's academic paper nor prior Lakebase posts had foregrounded as a performance primitive. (Source: sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes)
-
FPW inflates WAL by up to 15× in write-heavy workloads. Verbatim: "on write-heavy applications, logging entire 8KB pages can inflate log volume by up to 15x, often becoming the system's biggest performance bottleneck." First wiki quantification of the FPW write-amplification factor as a ceiling number. The corresponding post-image-pushdown measurement: 58 KB/transaction → <4 KB/transaction — a 94% reduction in compute-emitted WAL volume.
-
Turning FPW off without remediation creates unbounded delta chains. Verbatim: "Without those periodic full page images in the log, the storage layer would have to replay an infinitely long chain of small deltas to reconstruct a page for a read request. What was once a bounded O(checkpoint frequency) replay becomes an unbounded chain, leading to a spike in read latency and resource consumption." Canonicalises the concepts/delta-chain-replay bounded-vs-unbounded property: FPW doubles as a periodic reset point for the delta chain on the read path, even though its stated purpose is torn-page recovery on the write path. Any architecture that removes FPW must provide a replacement reset-point mechanism or read latency regresses.
-
Image generation pushdown — the central architectural innovation. Verbatim: "We solved this by moving the intelligence from the compute node to the storage layer. We call this image generation pushdown. When Postgres compute requests a page from storage, the pageserver (a component of the Lakebase distributed storage system) reconstructs it by finding the most recent materialized image of that page and replaying any WAL deltas on top. … To avoid this problem we pushed down the image-generation responsibility from the compute's WAL stream into the storage layer, preserving the bounded read behavior of storage while still eliminating the WAL overhead on the compute. The pageserver now generates full page images when a page has accumulated more delta records than a configured threshold without an intervening image." Canonicalises patterns/image-generation-pushdown-to-storage as a first-class pattern: work that was previously embedded in the compute's WAL stream (for torn-page reasons) is structurally better placed on the storage side (which is where reads materialise pages anyway). Three named benefits: (a) network efficiency — 94% WAL reduction; (b) scalability — image generation shared across multiple pageservers in the background; (c) optimal reads — image cadence tied to actual page-change rate, not unrelated checkpoint cadence.
-
Compute-size scaling of throughput becomes linear (and the old architecture's didn't). HammerDB TPROC-C (TPC-C-derived OLTP benchmark) New Orders Per Minute (NOPM):
| Compute size | Before (NOPM) | After (NOPM) | Gain |
|---|---|---|---|
| 4 vCPU | 78,876 | 94,891 | +20% |
| 16 vCPU | 95,832 | 269,189 | 2.8× |
| 32 vCPU | 95,686 | 439,300 | 4.5×+ |
Notice the flat pre-change curve between 16 and 32 vCPU (95,832 → 95,686) — compute resources were not being used. The FPW bottleneck capped throughput before CPU did. Verbatim: "On a 32 vCPU compute, the improvement exceeded 450%. … By removing Postgres's FPW bottleneck, we allowed throughput to scale linearly with compute resources. This is something monolithic Postgres struggles to do under heavy write load."
-
Production validation on a 56 vCPU customer: WAL 30 MB/s → 1 MB/s. Verbatim: "In a production environment for a high-profile 56 vCPU project, enabling image pushdown reduced steady-state WAL generation from 30 MB/s to just 1 MB/s." 30× WAL-rate reduction on a single customer's real workload. Directly correlates to increased transaction throughput during daily peaks.
-
Read-path dividend. Verbatim: "By optimizing the delta chains, the number of WAL records that must be applied per read dropped significantly. We saw p99 read latencies drop by 30% to 50% and p50 latencies drop by approximately 30%." At regional fleet level: "total amount of WAL generated by computes drop by up to 4x. P99 latency of reads from the storage engine improved by up to 3x and became much more stable." Both the write-tax elimination (via image-generation pushdown into a cadence tied to actual page change) and the read-tax elimination (via better delta-chain reset cadence) come from the same architectural change.
-
Synced Tables ingestion 17k → 62k rows/sec (3.6×). Verbatim: "For data-intensive Synced Tables, the impact was immediate. One customer saw ingestion throughput jump from 17k rows per second to 62k rows per second, which is a 3x increase, simply by enabling image pushdown." Canonical worked example of the same architectural change paying off at a higher-level product primitive.
-
Seamless zero-downtime rollout via existing Postgres
XLOG_FPW_CHANGEWAL record. Verbatim: "The change was applied to running computes via our control plane and storage system, which coordinated the transition automatically. This was achieved using the existing PostgresXLOG_FPW_CHANGE WALrecord mechanism, meaning no restarts or interruptions were required for our customers." Canonicalises patterns/live-wal-protocol-switch-via-xlog-fpw-change — rolling out a breaking change to the WAL protocol contract between compute and storage by piggybacking on an existing Postgres control record that both sides already understand. "Since late March" = ~6-week rollout window across the entire Lakebase Serverless + Neon global fleet. -
Neon-lineage open-source reference. The post links to "Deep dive into Neon storage engine" as the deeper mechanism reference for pageserver page reconstruction, and to "recent storage performance improvements at Neon" as the broader arc this post is one step of. Companion post: "Zero-downtime patching: Lakebase Part 1 — prewarming" — image-generation pushdown is one of several "move heavy-lifting tasks away from your transactions and into our scalable background storage stack" efforts.
Systems / concepts / patterns extracted¶
Systems¶
- Lakebase (extended) — the managed serverless Postgres product that inherits this architectural improvement from the Neon-lineage storage engine. First wiki disclosure of a specific performance-engineering axis on Lakebase beyond the agent-provisioning / encryption / branching axes already canonicalised from earlier sources.
- Pageserver + Safekeeper (extended) — the pageserver gains a new canonical responsibility (image generation pushdown) distinct from its pre-existing page-reconstruction-on-read role. The safekeeper's Paxos-based quorum durability is also named explicitly for the first time in the wiki corpus.
- PostgreSQL (extended) — the FPW
design + checkpoint cadence +
XLOG_FPW_CHANGEcontrol record are the Postgres primitives Lakebase is working with and against. New wiki disclosure of the 15×-WAL-inflation ceiling and theXLOG_FPW_CHANGErecord as a live-rollout vehicle. - HammerDB (new) — the TPC-C-derived OLTP benchmark tool used for the scaling measurement; first wiki-canonical reference.
Concepts¶
- Postgres Full Page Write (FPW) (new) — the durability primitive: first modification of a page after a checkpoint writes the entire 8 KB page into WAL as a torn-page-recovery backup. Up to 15× WAL inflation on write-heavy workloads is canonicalised as the cost; the "architecturally-unnecessary when compute has no local disk" framing is the architectural innovation of this source.
- Torn page (new) — the failure mode FPW was designed to prevent: a crash mid-write of an 8 KB page produces an on-disk page that is partially old and partially new, which WAL-log replay over it would corrupt permanently. The "doesn't exist in compute-storage-separated architectures where compute has no local disk" framing is load-bearing.
- Postgres checkpoint (new) — "a milestone marker in the log" (Databricks' framing, unlike a snapshot) — distinguishes checkpoint from snapshot; during a checkpoint, Postgres flushes modified pages from memory to disk up to a specific log point. FPW is scoped per-checkpoint-interval: once a page has had its post-checkpoint FPW, subsequent modifications within the same interval log only the delta. Canonicalises the asymmetry between the WAL-replay-on-recovery purpose of checkpoints (the stated purpose) and their incidental side effect of bounding the FPW cadence.
- Delta chain replay (new) — the read-path reconstruction mechanism: find the most recent materialised image of a page, then apply accumulated WAL deltas on top. The bounded vs unbounded property is the load-bearing one: with periodic images, replay cost is bounded by checkpoint cadence; without them, it's unbounded. Canonicalises image-generation threshold as the architectural knob.
- concepts/compute-storage-separation (extended) — new axis named: separation enables structural elimination of durability primitives that existed to handle local-disk failure modes. The FPW primitive is the first canonical wiki instance where compute-storage separation deletes (not just relocates) work from the compute side.
- concepts/wal-record-granularity (extended) — the
XLOG_FPW_CHANGEWAL record is a first-class Postgres control record canonical to this page; extends the wiki's existing WAL-record-granularity coverage with the observation that control records (not just data records) participate in WAL.
Patterns¶
- patterns/image-generation-pushdown-to-storage (new) — move periodic image-materialisation work from the compute's WAL stream into the storage tier's background processing. Canonical reusable architectural shape: anywhere a write-side primitive exists to bound a read-side replay cost, on a separated-storage-compute substrate that primitive belongs on the storage side. When-fits / when-doesn't criteria + three-benefit analysis (network / scalability / read-cadence) + failure modes canonicalised.
- patterns/live-wal-protocol-switch-via-xlog-fpw-change
(new) — roll out a breaking change to the WAL protocol
contract between compute and storage by piggybacking on
an existing Postgres control record (
XLOG_FPW_CHANGE) that both sides already understand. No customer restart required because the control record is a pre-existing Postgres concept; the control-plane and storage-system coordinate the flip atomically per-compute via this record.
Operational numbers¶
HammerDB TPROC-C benchmark (New Orders Per Minute)¶
| Compute size | Before | After | Gain | WAL/txn before | WAL/txn after |
|---|---|---|---|---|---|
| 4 vCPU | 78,876 | 94,891 | +20% | — | — |
| 16 vCPU | 95,832 | 269,189 | 2.8× | — | — |
| 32 vCPU | 95,686 | 439,300 | 4.5×+ | 58 KB | <4 KB |
- WAL per transaction: 58 KB → <4 KB (94% reduction).
- Pre-change 16v → 32v scaling: flat (compute unused).
- Post-change 16v → 32v scaling: linear (throughput scales with compute).
Production datapoints¶
- 56 vCPU production customer: WAL 30 MB/s → 1 MB/s (30× reduction).
- p99 read latency: −30% to −50% at per-customer altitude.
- p50 read latency: ~−30% at per-customer altitude.
- Regional fleet WAL generation: up to 4× drop.
- Regional fleet p99 storage-engine read latency: up to 3× improvement, much more stable.
- Synced Tables ingestion throughput (one customer): 17k rows/sec → 62k rows/sec (3.6×).
Rollout¶
- Since late March 2026 → active across all Lakebase Serverless + Neon databases globally by 2026-05-07 (~6-week rollout window).
- Mechanism: existing Postgres
XLOG_FPW_CHANGEWAL record. - Zero customer restarts or interruptions.
Caveats¶
- Tier-3 + vendor-performance-framing. Databricks Blog is Tier-3 per AGENTS.md; the post's numbers are all directionally favourable to the product and the framing is explicitly performance-marketing-adjacent ("the Postgres write tax is officially a thing of the past"). The mechanism disclosure is substantive enough to pass scope (canonicalises two reusable patterns + four new concepts + one new system + extends four existing pages), but absolute numbers should be read with vendor-benchmark discount.
- Image-generation threshold is not disclosed. "more delta records than a configured threshold without an intervening image" — the specific threshold value is not shared, nor is whether it's fleet-global or per-workload-tunable. Future Neon engineering posts likely disclose.
- Write-path side effects on the pageserver not quantified. Image-generation pushdown moves CPU + IO work onto the pageserver tier. The post notes "Image generation for a project branch is now shared across multiple pageservers in the background" but does not quantify pageserver CPU / IO / object-storage write amplification cost. "Bottleneck shifted from compute's WAL path to storage's background work" — and the storage-tier cost-model is not disclosed here.
- Read latency numbers are for "storage-engine reads" — i.e. reads that hit the pageserver (missed the compute-local cache). Client-visible query latency depends on cache hit rate; post doesn't disclose what fraction of queries go through pageserver vs serve from compute-local cache.
- HammerDB TPROC-C is a synthetic OLTP benchmark. It is derived from TPC-C, which is a warehouse-ordering workload — write-heavy with some read. Not every workload's WAL-FPW profile matches TPC-C's; some workloads (e.g. append-only ingestion, read-heavy OLAP) would see different or no gains.
- Postgres-internal WAL mechanism details are simplified
for a product-marketing audience. Readers wanting the
full mechanism should follow the
Deep dive into Neon storage engine
link + upstream Postgres source for
XLOG_FPW_CHANGEsemantics. - Scope of FPW-disable. The post does not discuss whether FPW is disabled entirely on compute or only conditionally; whether any workload still needs compute-side FPW; or how Lakebase's classical (non- serverless) fleet is affected if there is one.
Source¶
- Original: https://www.databricks.com/blog/how-lakebase-architecture-delivers-5x-faster-postgres-writes
- Raw markdown:
raw/databricks/2026-05-07-how-lakebase-architecture-delivers-5x-faster-postgres-writes-14329493.md - Linked "Deep dive into Neon storage engine": https://neon.com/blog/get-page-at-lsn
- Linked "recent storage performance improvements at Neon": https://neon.com/blog/recent-storage-performance-improvements-at-neon
- Companion "Zero-downtime patching: Lakebase Part 1 — prewarming": https://www.databricks.com/blog/zero-downtime-patching-lakebase-part-1-prewarming
- Related "What is a lakebase": https://www.databricks.com/blog/what-is-a-lakebase
- Neon architecture overview: https://neon.com/docs/introduction/architecture-overview
- HammerDB project: https://www.hammerdb.com/
Related¶
- companies/databricks — Databricks Engineering blog hub.
- systems/lakebase — the product that inherits this improvement.
- systems/pageserver-safekeeper — the storage-tier components that now own image generation.
- systems/postgresql — upstream substrate whose FPW +
checkpoint +
XLOG_FPW_CHANGEprimitives are the ones being worked with and around. - sources/2026-04-30-databricks-backstage-with-lakebase — sibling Lakebase ingest four days earlier; that one introduced concepts/point-in-time-recovery and concepts/wal-record-granularity on Lakebase at the operational-disclosure altitude, this one is the performance-engineering-axis sibling at the mechanism- disclosure altitude.
- sources/2026-04-29-databricks-and-stripe-projects-infrastructure-built-for-agents — Stripe Projects / agent-provisioning axis of Lakebase.
- sources/2026-04-27-databricks-inside-one-of-the-first-production-deployments-of-lakebase-langguard — LangGuard bursty-workload axis of Lakebase.
- sources/2026-04-20-databricks-take-control-customer-managed-keys-for-lakebase-postgres — CMK / two-tier-encryption axis of Lakebase.
- sources/2026-05-06-databricks-rethinking-distributed-systems-for-serverless-performance — sibling same-week Databricks architectural post on the serverless Spark side of the Databricks platform; together these two posts demonstrate Databricks is publishing mechanism-level disclosures at the Lakebase (OLTP) and Serverless Compute (OLAP) altitudes in the same week.