PATTERN Cited by 1 source
Image-generation pushdown to storage¶
Intent¶
Move periodic full page image generation from the compute node's WAL stream into the distributed storage layer's background processing. The compute side then sends only compact WAL deltas; the storage side decides when to materialise a new image based on actual page-change rate rather than an unrelated compute-side cadence (e.g. Postgres checkpoints).
Canonical instance: Lakebase / Neon, 2026-05-07¶
When Postgres compute requests a page from storage, the pageserver (a component of the Lakebase distributed storage system) reconstructs it by finding the most recent materialized image of that page and replaying any WAL deltas on top. … To avoid this problem we pushed down the image-generation responsibility from the compute's WAL stream into the storage layer, preserving the bounded read behavior of storage while still eliminating the WAL overhead on the compute. The pageserver now generates full page images when a page has accumulated more delta records than a configured threshold without an intervening image.
(Source: sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes)
Measured outcome on HammerDB TPROC-C:
- 94% reduction in WAL volume emitted by compute (58 KB/txn → <4 KB/txn).
- 5× write throughput at 32 vCPU (95,686 → 439,300 NOPM).
- Linear compute-size scaling (vs flat 16v→32v before).
- p99 read latency −30% to −50%; p50 −~30%.
- Production customer datum: WAL rate 30 MB/s → 1 MB/s (30× reduction on a 56 vCPU workload).
- Synced Tables ingestion: 17k → 62k rows/sec (3.6×).
- Rolled out across the global Lakebase + Neon fleet in ~6 weeks (late-March 2026 → 2026-05-07) with zero customer restarts via the patterns/live-wal-protocol-switch-via-xlog-fpw-change pattern.
Three named benefits¶
The 2026-05-07 post names three structural benefits of the pushdown:
- Network efficiency. "The compute sends only the compact deltas, which are the actual changes, leading to a 94% reduction in traffic in our benchmarks." WAL-over-network is a load-bearing cost on compute-storage-separated architectures; eliminating the full-page component slashes bandwidth.
- Scalability. "Work is moved from the single Postgres writer to the distributed, independently scalable storage layer. Image generation for a project branch is now shared across multiple pageservers in the background." A write-path task that previously ran single-threaded in the Postgres compute now runs horizontally-parallel on the storage fleet.
- Optimal reads. "When images are generated is now based on actual changes to a page rather than the unrelated Postgres checkpoint process." Per-page-threshold decisions match work to workload instead of applying checkpoint-scoped cadence to every page uniformly.
Structure¶
Classical Postgres / Neon before pushdown:
┌─────────┐ WAL (58 KB/txn with FPW) ┌──────────────┐
│ compute │ ─────────────────────────────────────────▶│ safekeeper │
│ │ includes: deltas + full page images │ + pageserver │
└─────────┘ triggered by CHECKPOINT cadence └──────────────┘
Lakebase after image-generation pushdown:
┌─────────┐ WAL (<4 KB/txn, deltas only) ┌──────────────┐
│ compute │ ─────────────────────────────────────────▶│ safekeeper │
│ │ includes: deltas only │ │
└─────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ pageserver │
│ generates │
│ images in │
│ background │
│ when a page │
│ accumulates │
│ N deltas │
└──────────────┘
Underlying architectural principle¶
The pattern is a specific case of a broader principle:
Work that was embedded in the compute side to handle a failure mode that no longer exists on compute-storage-separated architectures should be relocated to the storage side or eliminated entirely.
Here, compute-side Full Page Write was designed to tolerate torn pages on the local-disk page heap. On stateless compute with WAL streamed to a Paxos-based safekeeper quorum, there is no local-disk page to tear — the failure mode doesn't exist, and the compute-side FPW can be disabled. But FPW had an incidental read-path role (bounding delta chain length on the pageserver's read-time page reconstruction), which the pushdown preserves by moving image-generation to the storage tier.
When it fits¶
- Compute-storage-separated databases where compute emits a write log (WAL / redo log / binlog) consumed by a separate storage tier.
- Where the write log contains both delta records and larger periodic-reset records (FPW / FPI / full-record snapshots).
- Where the reset records exist to bound read-path replay cost on the storage side (even if their stated purpose is write-path recovery).
- Where the storage tier can generate reset records out-of-band from the write stream, based on its own observation of per-object change rate.
- Where the storage tier has horizontal scalability — image-generation can be parallelised across multiple nodes.
When it doesn't fit¶
- Classical monolithic databases where compute and storage are the same process — pushing work to the "storage layer" has no effect because there isn't a separate process.
- Write logs that must be self-contained for regulatory reasons — e.g. forensic-replay requirements where the WAL alone must fully describe state at every point. Pushdown splits the state-reconstruction surface across two tiers.
- Architectures where the storage-tier workload is already the bottleneck — pushdown adds work to storage; if storage is saturated, this moves the bottleneck rather than relieves it.
- When the reset-record cadence on compute is fundamentally tied to a write-path property that isn't moving. E.g. if a reset record is also the cross-region sync primitive and other tiers depend on its compute-side-emitted timing.
- During the rollout window itself — changing the WAL protocol contract between compute and storage is dangerous; requires a live-switch mechanism (see patterns/live-wal-protocol-switch-via-xlog-fpw-change) or downtime.
Failure modes¶
- Image-generation falls behind write rate. Storage tier accumulates long delta chains because image generation isn't keeping up. Read latency regresses. Mitigations: raise image generation parallelism, lower the threshold, add back-pressure on write rate.
- Image-generation over-runs. Too-aggressive image-generation wastes storage-tier CPU and object-storage write bandwidth. Mitigation: tune threshold up; add adaptive rate-limiting.
- Per-page-threshold tuning fleet-wide vs per-workload. Single global threshold may not fit both write-heavy-hot-page workloads and write-light-cold-page workloads. Mitigation: per-workload / per-project tuning surface (Databricks does not disclose whether this is available).
- Rollout-window split-brain. During rollout, some computes have FPW on and some have FPW off; pageserver must handle both. Mitigation: use the patterns/live-wal-protocol-switch-via-xlog-fpw-change control record to make each compute's switch atomic and visible to the storage side.
- Compute-local cache invalidation. If the compute-local cache held a page that was materialised differently after a pushdown image was generated, cache coherence requires care. Not addressed in the 2026-05-07 post; likely handled by the existing Neon cache-invalidation machinery.
Relationship to adjacent patterns¶
- Sibling to patterns/storage-forwarded-redo-log-replication (Aurora's write-path pattern) — Aurora forwards redo records to storage for replay; Lakebase additionally has the storage tier generate its own periodic images from delta-chain observation. Same design philosophy (work belongs on storage) at different granularities.
- Composes with patterns/live-wal-protocol-switch-via-xlog-fpw-change — the rollout mechanism that makes this pattern deployable without customer downtime on a live fleet.
- Generalises the classical database-systems observation "compaction belongs on the storage side" from LSM trees (RocksDB / Cassandra / LevelDB) to B-tree-page-addressed Postgres on separated storage.
- Parallel to patterns/background-reconciler-for-read-path-optimization (RocksDB background compaction; LSM pattern) — both run read-path-bounding work out-of-band from the write path.
Seen in¶
- sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes
— canonical first-class wiki pattern page. Image-generation
pushdown eliminates compute-side FPW-inflicted WAL inflation
(94% reduction, 58 KB/txn → <4 KB) while preserving bounded
delta-chain replay on the read path. 5× write throughput at
32 vCPU on HammerDB TPROC-C; linear compute-size scaling
(previously flat due to FPW bottleneck); 30-MB/s-to-1-MB/s
WAL-rate reduction on a production 56-vCPU customer; p99
read latency down 30–50%; Synced Tables ingestion 3.6×.
Rolled out across the global Lakebase + Neon fleet in ~6 weeks
with zero customer restarts via the
XLOG_FPW_CHANGElive-switch mechanism.
Related¶
- concepts/delta-chain-replay — the read-path primitive this pattern keeps bounded.
- concepts/postgres-full-page-write — the classical primitive being replaced on compute.
- concepts/compute-storage-separation — the architectural precondition enabling the pattern.
- concepts/torn-page — the failure mode FPW existed for; absent on stateless compute.
- concepts/postgres-checkpoint — the classical cadence primitive the pattern decouples from.
- systems/pageserver-safekeeper — the storage-tier components that absorb the image-generation work.
- systems/lakebase — canonical production instance.
- systems/postgresql — the upstream DB engine whose FPW semantics motivated the pattern.
- patterns/live-wal-protocol-switch-via-xlog-fpw-change — the deployment pattern that lets this roll out without downtime.