CONCEPT Cited by 7 sources
Write-Ahead Logging (WAL)¶
Definition¶
Write-Ahead Logging (WAL) is the durability primitive under nearly every production-grade relational database, distributed log, and many object stores. All state-changing operations — inserting rows, modifying indexes, taking row locks, committing transactions — are first written to an append-only log, and the log is fsync'd to stable storage before the transaction is reported as durable. The underlying heap/index pages can be written back later (lazily, by a background writer or at checkpoints); on crash, replaying the WAL from the last checkpoint restores the system to a consistent committed state.
Two canonical invariants:
- Write-ahead: any modification visible to future readers was first recorded in the log.
- Commit-before-ack: a transaction is not acknowledged to the client until its commit record is on stable storage.
Why it matters for system design¶
WAL collapses a lot of concurrency + durability concerns into a single sequential log, which has two large second-order consequences:
-
Every committed transaction costs one fsync on the WAL stream. A Postgres cluster has a single writer for its WAL, so the fsync rate of the underlying disk is a hard global ceiling on commit rate. On a gp3 EBS volume,
pg_test_fsyncmeasured ~1,000 8-KiB fsyncs/sec at Datadog before Postgres began batching commits and adding per-transaction latency (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). This budget is shared across the whole cluster — a single misbehaving query that emits an unexpected COMMIT record per call can consume the entire cluster's sync headroom. -
Non-obvious operations still emit WAL records. Anything that mutates a page's metadata — including taking a row lock under MVCC — is a durable change and is logged. Under Postgres
INSERT ... ON CONFLICT DO UPDATE, the row lock is taken before theWHEREpredicate is evaluated; the lock assigns a transaction ID, the xid forces aTransaction COMMITrecord, and the COMMIT forces an fsync even though no row data changed (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). "No-op" upserts at 25,000/sec therefore would have overwhelmed the single-writer sync budget.
WAL as the observability surface¶
WAL is also the most precise observability surface for "what actually changed". In
Postgres 15+, the pg_walinspect extension exposes pg_get_wal_records_info(lsn,
lsn) and pg_get_wal_stats(lsn, lsn), returning the resource manager, record type,
and relation/block reference for each WAL record between two log sequence numbers.
Datadog used it to prove which WAL records their no-op upsert was actually
emitting:
| Scenario | WAL records |
|---|---|
Upsert on existing row, WHERE false |
Heap LOCK_ONLY/EXCL_LOCK; Transaction COMMIT |
| CTE-emulated insert of new row | Heap INSERT+INIT; Btree NEWROOT; Btree INSERT_LEAF; Heap HEAP_CONFIRM; Transaction COMMIT |
| CTE-emulated update of existing row (HOT-qualifying) | Heap HOT_UPDATE; Transaction COMMIT |
CTE-emulated upsert on recent row, WHERE false |
(none) |
The empty-rows outcome is the design goal — no-op transactions should emit zero WAL records — and WAL introspection is how you verify it rather than inferring from application-level metrics that, by construction, can't see locks that didn't touch user-visible data (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal).
pg_get_wal_stats is the quantitative cousin, used to measure the cost of index
maintenance or full-page images (FPI)
against the same WAL budget.
Relation to other wiki concepts¶
- concepts/postgres-mvcc-hot-updates — MVCC, HOT updates and
fillfactorare about reducing the volume and fan-out (indexes!) of WAL records a single update emits. WAL is the budget; HOT is a conservation tactic. - systems/pageserver-safekeeper — the Neon/Lakebase answer externalises WAL (into Safekeeper) and page-addressable durable state (into Pageserver) so Postgres compute can scale to zero. The write-ahead invariant is preserved across the network boundary.
- patterns/cte-emulated-upsert — the Datadog case is a worked example of designing queries to not emit WAL records on the common path, by avoiding implicit row locks.
Seen in¶
- sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog diagnoses a
2× IOPS / 4× WAL-sync regression from an
ON CONFLICT DO UPDATEupsert that locks conflicting rows even when theWHEREfilters them out, usingpg_walinspect+ anlldbbreakpoint onWALInsertLockAcquireto trace the call path fromExecOnConflictUpdatethroughheap_lock_tupleto the WAL insert. - sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform
— the Postgres WAL as the carrier of an async CDC replication
stream. With
wal_level=logical, Postgres's WAL becomes both the durability primitive and the source of truth that Debezium tails to emit row-level change events into Kafka topics. The operational consequence is heartbeat tables: a replication slot pins WAL until its consumer advances LSN, so quiet periods risk WAL growing unboundedly; heartbeat tables produce artificial traffic that keeps slots moving. Canonical wiki reference for concepts/logical-replication as a CDC source on top of WAL. - sources/2025-05-20-flyio-litestream-revamped — SQLite WAL as hijacked replication source instance. The original 2020 Litestream opened a long-lived read transaction against a SQLite database to arrest checkpointing, then copied WAL frames out to a shadow WAL staging area before uploading them to object storage. Structurally similar to Postgres CDC (the WAL is repurposed as a replication source) but operated externally rather than via a replication slot. The 2025-05-20 redesign retires this approach — Litestream now ships transaction-aware LTX files instead of raw WAL frames — so the WAL returns to being just the local durability primitive, with LTX as the replication substrate on top.
- sources/2025-10-02-flyio-litestream-v050-is-here —
WAL-as-replication-source retirement confirmed shipped.
Litestream v0.5.0 removes the shadow-WAL mechanism entirely
and renames the user-facing CLI from
litestream waltolitestream ltx. SQLite's WAL remains the substrate's internal durability primitive; it is no longer what Litestream ships over the wire. v0.5.0 cannot restore from old v0.3.x WAL segment files — the file-format break is one-way (old WAL files are left on disk for rollback, not for cross-format PITR). Instructive distinction: the same three-letter acronym now means two different things at two different levels of the stack —WALis a SQLite-internal log on-disk,ltxis the replication surface on-wire.
WAL commit order ≠ visibility order (Postgres)¶
WAL establishes a durable total order on commits — every committed
transaction has a unique LSN, and replaying WAL in LSN order reproduces
the primary's state. But in Postgres, WAL commit order is not the
same as the order in which concurrent transactions become visible to
new snapshots on the primary: the commit path writes the WAL record,
then asynchronously removes the xid from the in-memory ProcArray,
and new snapshots scan ProcArray at acquisition time. Two concurrent
non-conflicting commits can flip ProcArray removal order relative to
their WAL LSNs. (Source:
sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas.)
This admits the Long Fork anomaly — primary + replica observers see the same pair of commits in different orders — a violation of concepts/snapshot-isolation's atomic-visibility property. See concepts/visibility-order-vs-commit-order for the generalized framing, and concepts/commit-sequence-number for the proposed upstream fix that re-aligns visibility with WAL commit order.
The structural reminder: durability and visibility are orthogonal properties; binding them together has a cost the commit path pays, and Postgres has historically chosen not to. Systems that do bind them — systems/aurora-dsql and systems/aurora-limitless via time-based MVCC — replace the visibility substrate entirely, not the WAL.
Raft log as WAL¶
A notable variant: the Redpanda Cloud Topics Metastore uses its Raft consensus log as the WAL for an embedded LSM-tree, eliminating the separate file-based WAL entirely. Writes are Raft-proposed; majority acknowledgment provides durability. Manifest commits mark Raft-log truncation points. See concepts/raft-replicated-wal-as-lsm-wal for the concept and patterns/raft-log-as-lsm-wal for the pattern (Source: sources/2026-06-09-redpanda-cloud-topics-the-metastore).
WAL as streaming durability handoff¶
Zerobus Ingest implements a latency-optimized WAL as the durability layer between ingestion and lakehouse publish. The WAL provides the commit-before-ack guarantee at the service boundary: messages are durable in the WAL before the server returns the highest committed offset (async ack loop via gRPC bidirectional streaming). Delta Kernel Rust then asynchronously commits from WAL to Delta tables. This decouples the durability SLO (milliseconds) from the queryability SLO (seconds) — a different shape than database WAL where durability and visibility are on the same commit path. See patterns/wal-before-lakehouse-publish for the pattern (Source: sources/2026-06-11-databricks-ingesting-the-milky-way-petabyte-scale-with-zerobus-ingest).