Skip to content

CONCEPT Cited by 2 sources

Postgres MVCC and Heap-Only Tuple (HOT) updates

Definition

MultiVersion Concurrency Control (MVCC) in Postgres is implemented by writing a new row version ("tuple") for every UPDATE rather than modifying the existing row in place. Old versions remain visible to transactions whose snapshots predate the update, and are cleaned up later by VACUUM. This makes readers lock-free at the cost of write amplification: a single logical row update produces at least one new heap tuple plus one new index entry per index on the table.

Heap-Only Tuple (HOT) updates are the optimization that short-circuits the index-update cost when two conditions both hold:

  1. The update does not modify any indexed column.
  2. There is free space on the same heap page as the original tuple to hold the new tuple.

Under those conditions, Postgres writes the new tuple on the same page, chains it to the old tuple via a HOT chain, and skips writing any index entries at all — existing index entries still point at the old row head, which now redirects through the HOT chain to the current live tuple. Since index writes are frequently the dominant component of both WAL volume and IO cost, HOT is a large optimization for update-heavy workloads on lightly-indexed hot tables.

Why the knobs matter together

HOT's free-space precondition makes fillfactor a load-bearing parameter. A Postgres table stores rows in 8 KiB pages; fillfactor=100 (the default) fills pages completely on insert. Dropping fillfactor to e.g. 80% means inserts leave 20% of each page free, giving subsequent HOT updates somewhere to go without spilling to a new page. Datadog set fillfactor=80 on their host_last_ingested table specifically so their once-per-day-per-host timestamp updates would qualify for HOT (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). They also kept last_ingested unindexed — in combination with the fillfactor, this guarantees updates touch only the heap page.

Three knobs, one goal — minimize WAL volume per update:

Knob Mechanism Datadog value
Move mutating columns to a dedicated narrow table Keep unchanged columns out of the new-tuple copy host_last_ingested split out from main hosts table
Keep the mutating column unindexed Make updates index-neutral so HOT is applicable No index on last_ingested
fillfactor < 100 Leave free space on each page so HOT stays on-page fillfactor=80

What HOT does not save you from

HOT reduces the cost of updates that actually happen. It does not help with:

  • Row locks taken without an update. INSERT ... ON CONFLICT DO UPDATE locks the conflicting row before evaluating the WHERE predicate; the lock mutates tuple metadata (the xid) and emits a Heap LOCK_ONLY WAL record + a COMMIT record, even when WHERE filters the row out and no HOT update ever happens (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). HOT composes with the update cost, not the lock cost. This is the subtle failure mode that cost Datadog a 2× IOPS / 4× WAL-sync regression on a table specifically designed to maximize HOT applicability.
  • VACUUM / bloat. Old tuple versions still accumulate, still need to be vacuumed. HOT chains have their own traversal cost.
  • Primary-key updates. Primary key is always indexed; an update to it forces index writes.

Relation to other wiki concepts

  • concepts/wal-write-ahead-logging — MVCC + HOT are WAL-volume conservation tactics. WAL is the budget; MVCC+HOT minimize records per logical operation.
  • patterns/cte-emulated-upsert — pattern-level answer to the "HOT doesn't help with lock-only WAL records" failure mode — restructure the query so the common path emits no WAL records at all.

Seen in

  • sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog's host_last_ingested table design (dedicated narrow table + unindexed last_ingested + fillfactor=80) for once-per-day HOT updates; the ensuing discovery that HOT qualification is irrelevant when the query path emits a pre-update lock WAL record per call. After switching to a patterns/cte-emulated-upsert, pg_walinspect confirmed HOT updates firing as designed (Heap HOT_UPDATE + Transaction COMMIT, no Btree records) on rows older than 1 day.

The visibility half: ProcArray and snapshots

Write amplification (above) is one half of Postgres MVCC — the per-UPDATE cost of producing a new tuple version. The other half is how readers decide which versions are visible to them: every snapshot records a list of currently-pending xids (the in-memory ProcArray) and permanently excludes those xids from the snapshot even after they commit. A committing transaction (a) writes its WAL commit record (durable), then (b) asynchronously removes its xid from ProcArray (visible).

That decoupling is where concepts/visibility-order-vs-commit-order can skew and admit the Long Fork anomaly across primary + replica (Source: sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas). The upstream fix under discussion is Commit Sequence Numbers — stamp a monotonic CSN on each commit and snapshot by CSN watermark instead of ProcArray scan. systems/aurora-dsql and systems/aurora-limitless sidestep via time-based MVCC — snapshot = clock read, visibility = commit-time comparison — replacing the ProcArray substrate wholesale via Postgres-extension surgery (patterns/postgres-extension-over-fork).

Orthogonality: HOT updates + fillfactor + narrow tables are about minimizing per-UPDATE cost within the existing visibility model. CSN / time-based MVCC change the visibility model itself. Both halves compose — you can imagine a future Postgres that is both HOT-optimized and CSN-visible.

Last updated · 200 distilled / 1,178 read