Skip to content

CONCEPT Cited by 3 sources

Write-Ahead Logging (WAL)

Definition

Write-Ahead Logging (WAL) is the durability primitive under nearly every production-grade relational database, distributed log, and many object stores. All state-changing operations — inserting rows, modifying indexes, taking row locks, committing transactions — are first written to an append-only log, and the log is fsync'd to stable storage before the transaction is reported as durable. The underlying heap/index pages can be written back later (lazily, by a background writer or at checkpoints); on crash, replaying the WAL from the last checkpoint restores the system to a consistent committed state.

Two canonical invariants:

  1. Write-ahead: any modification visible to future readers was first recorded in the log.
  2. Commit-before-ack: a transaction is not acknowledged to the client until its commit record is on stable storage.

Why it matters for system design

WAL collapses a lot of concurrency + durability concerns into a single sequential log, which has two large second-order consequences:

  • Every committed transaction costs one fsync on the WAL stream. A Postgres cluster has a single writer for its WAL, so the fsync rate of the underlying disk is a hard global ceiling on commit rate. On a gp3 EBS volume, pg_test_fsync measured ~1,000 8-KiB fsyncs/sec at Datadog before Postgres began batching commits and adding per-transaction latency (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). This budget is shared across the whole cluster — a single misbehaving query that emits an unexpected COMMIT record per call can consume the entire cluster's sync headroom.

  • Non-obvious operations still emit WAL records. Anything that mutates a page's metadata — including taking a row lock under MVCC — is a durable change and is logged. Under Postgres INSERT ... ON CONFLICT DO UPDATE, the row lock is taken before the WHERE predicate is evaluated; the lock assigns a transaction ID, the xid forces a Transaction COMMIT record, and the COMMIT forces an fsync even though no row data changed (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). "No-op" upserts at 25,000/sec therefore would have overwhelmed the single-writer sync budget.

WAL as the observability surface

WAL is also the most precise observability surface for "what actually changed". In Postgres 15+, the pg_walinspect extension exposes pg_get_wal_records_info(lsn, lsn) and pg_get_wal_stats(lsn, lsn), returning the resource manager, record type, and relation/block reference for each WAL record between two log sequence numbers. Datadog used it to prove which WAL records their no-op upsert was actually emitting:

Scenario WAL records
Upsert on existing row, WHERE false Heap LOCK_ONLY/EXCL_LOCK; Transaction COMMIT
CTE-emulated insert of new row Heap INSERT+INIT; Btree NEWROOT; Btree INSERT_LEAF; Heap HEAP_CONFIRM; Transaction COMMIT
CTE-emulated update of existing row (HOT-qualifying) Heap HOT_UPDATE; Transaction COMMIT
CTE-emulated upsert on recent row, WHERE false (none)

The empty-rows outcome is the design goal — no-op transactions should emit zero WAL records — and WAL introspection is how you verify it rather than inferring from application-level metrics that, by construction, can't see locks that didn't touch user-visible data (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal).

pg_get_wal_stats is the quantitative cousin, used to measure the cost of index maintenance or full-page images (FPI) against the same WAL budget.

Relation to other wiki concepts

  • concepts/postgres-mvcc-hot-updates — MVCC, HOT updates and fillfactor are about reducing the volume and fan-out (indexes!) of WAL records a single update emits. WAL is the budget; HOT is a conservation tactic.
  • systems/pageserver-safekeeper — the Neon/Lakebase answer externalises WAL (into Safekeeper) and page-addressable durable state (into Pageserver) so Postgres compute can scale to zero. The write-ahead invariant is preserved across the network boundary.
  • patterns/cte-emulated-upsert — the Datadog case is a worked example of designing queries to not emit WAL records on the common path, by avoiding implicit row locks.

Seen in

  • sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog diagnoses a 2× IOPS / 4× WAL-sync regression from an ON CONFLICT DO UPDATE upsert that locks conflicting rows even when the WHERE filters them out, using pg_walinspect + an lldb breakpoint on WALInsertLockAcquire to trace the call path from ExecOnConflictUpdate through heap_lock_tuple to the WAL insert.
  • sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — the Postgres WAL as the carrier of an async CDC replication stream. With wal_level=logical, Postgres's WAL becomes both the durability primitive and the source of truth that Debezium tails to emit row-level change events into Kafka topics. The operational consequence is heartbeat tables: a replication slot pins WAL until its consumer advances LSN, so quiet periods risk WAL growing unboundedly; heartbeat tables produce artificial traffic that keeps slots moving. Canonical wiki reference for concepts/logical-replication as a CDC source on top of WAL.

WAL commit order ≠ visibility order (Postgres)

WAL establishes a durable total order on commits — every committed transaction has a unique LSN, and replaying WAL in LSN order reproduces the primary's state. But in Postgres, WAL commit order is not the same as the order in which concurrent transactions become visible to new snapshots on the primary: the commit path writes the WAL record, then asynchronously removes the xid from the in-memory ProcArray, and new snapshots scan ProcArray at acquisition time. Two concurrent non-conflicting commits can flip ProcArray removal order relative to their WAL LSNs. (Source: sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas.)

This admits the Long Fork anomaly — primary + replica observers see the same pair of commits in different orders — a violation of concepts/snapshot-isolation's atomic-visibility property. See concepts/visibility-order-vs-commit-order for the generalized framing, and concepts/commit-sequence-number for the proposed upstream fix that re-aligns visibility with WAL commit order.

The structural reminder: durability and visibility are orthogonal properties; binding them together has a cost the commit path pays, and Postgres has historically chosen not to. Systems that do bind them — systems/aurora-dsql and systems/aurora-limitless via time-based MVCC — replace the visibility substrate entirely, not the WAL.

Last updated · 178 distilled / 1,178 read