CONCEPT Cited by 1 source
Torn page¶
Definition¶
A torn page is a disk page (typically 8 KB in Postgres) that was partially written when a crash interrupted the write operation — the page on disk now contains a mix of old and new bytes that does not correspond to any valid state the database ever held.
If the database trusted a torn page and replayed a WAL delta (which encodes "change bytes 40-44 from X to Y") on top of that hybrid content, the result would be permanently corrupt: surrounding bytes the delta assumed were "pre-write state" are actually "post-partial-write half-new state", and the replay produces nonsense.
Why it happens¶
Writing a page to disk is not atomic at the hardware level. A Postgres page is 8 KB; disk sectors are typically 512 B or 4 KB. The OS sends a write; the disk controller breaks it into multiple sector-level writes; if power fails or the process crashes between them, some sectors are new and some are still old. The result is a torn page.
Modern storage stacks (filesystem journals, SSD write caches, RAID controllers, cloud block-storage replication) mitigate but rarely fully eliminate the risk, so Postgres has historically "assumed the worst" — any page in active write during a crash could be torn, and the database must be resilient to this.
Postgres's remediation: full page writes¶
Postgres handles torn pages via Full Page Writes (FPW): the first modification of a page after every checkpoint copies the entire 8 KB page into the WAL. On crash recovery, Postgres ignores the possibly-torn on-disk page and replays starting from the known-good WAL-resident copy. This guarantees recovery correctness but costs up to 15× WAL-volume inflation on write-heavy workloads (Source: sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes).
The architectural insight: no local disk → no torn pages¶
In compute-storage-separated architectures like Lakebase / Neon, compute has no local data directory. It streams WAL to a Paxos-based quorum of safekeepers; durable page storage lives in a distinct storage tier backed by object storage + local caches. There is no compute-local 8 KB page being written to disk that could tear.
Verbatim from the 2026-05-07 Databricks post:
In the lakebase architecture, your compute is stateless. It does not rely on a local data directory. Instead, it streams WAL to a Paxos-based quorum of safekeepers. Because there is no local-disk page to tear, the failure mode FPW was designed to prevent simply does not exist.
The torn-page failure mode becomes architecturally impossible, not merely mitigated. The FPW primitive that existed to handle it can be structurally eliminated from the compute side. This is first canonical wiki instance of a durability primitive being deleted (not just relocated) as a consequence of concepts/compute-storage-separation.
Where torn pages can still appear in Lakebase¶
- Storage-tier writes. The pageserver still writes materialised page images to object storage; torn-image is structurally different because object-storage writes are typically atomic at the object level (S3 writes are all-or-nothing), and local caches on pageservers are rebuilt from object storage on crash.
- Safekeeper WAL writes. WAL segments streamed by compute land on safekeeper disks; the safekeeper's Paxos-based quorum redundancy covers partial writes via replicated durability.
- Compute-local ephemeral. Postgres compute VMs still have scratch state (buffer pool in RAM, maybe temp-table spill files in local disk) but none of it is load-bearing for durability recovery — a compute-VM crash re-attaches to safekeeper + pageserver from clean state.
Storage-stack mitigations in classical Postgres¶
Even in classical (non-separated) Postgres deployments, several layers in the storage stack can partially mitigate torn pages:
- ext4 data=journal mode — journals data not just metadata; costly but eliminates torn pages at the filesystem layer.
- ZFS / Btrfs copy-on-write semantics — new writes go to new blocks; old block remains intact until write completes, so partial writes can't tear a page.
- Filesystems with 4 KB atomic-write guarantee — if the page size matches or divides into the atomic-write unit, torn pages become impossible at this layer.
- RAID controllers with battery-backed cache — buffer writes in persistent cache before committing to disk, eliminating the mid-write-crash window.
These all add cost or reduce the write-amplification protection FPW provides. The compute-storage-separation architectural answer is more radical: remove the local disk entirely from the compute side.
Seen in¶
- sources/2026-05-07-databricks-how-lakebase-architecture-delivers-5x-faster-postgres-writes — canonical first-class wiki page on torn pages. The Databricks post frames torn pages as "the failure mode FPW was designed to prevent" and makes the structural-elimination argument: stateless compute + WAL-to-safekeeper-quorum eliminates the local-disk page that could tear. Compute-side FPW is therefore safe to disable in Lakebase / Neon; image-generation pushdown onto the storage tier preserves the read-path bounded-replay property FPW implicitly provided.
Related¶
- concepts/postgres-full-page-write — the durability primitive that exists to tolerate torn pages; redundant on separated compute.
- concepts/postgres-checkpoint — the interval primitive that scopes FPW cadence.
- concepts/compute-storage-separation — the architectural property that eliminates the torn-page failure mode.
- systems/postgresql — upstream DB engine whose page + WAL-replay design necessitated the FPW remedy.
- systems/lakebase — canonical instance of architectural elimination of torn pages via compute-storage separation.
- systems/pageserver-safekeeper — the storage-tier components that absorb durability work compute used to handle locally.
- patterns/image-generation-pushdown-to-storage — the read-path remedy that makes FPW-disable practical.