CONCEPT Cited by 2 sources

Postgres MVCC and Heap-Only Tuple (HOT) updates¶

Definition¶

MultiVersion Concurrency Control (MVCC) in Postgres is implemented by writing a new row version ("tuple") for every UPDATE rather than modifying the existing row in place. Old versions remain visible to transactions whose snapshots predate the update, and are cleaned up later by VACUUM. This makes readers lock-free at the cost of write amplification: a single logical row update produces at least one new heap tuple plus one new index entry per index on the table.

Heap-Only Tuple (HOT) updates are the optimization that short-circuits the index-update cost when two conditions both hold:

The update does not modify any indexed column.
There is free space on the same heap page as the original tuple to hold the new tuple.

Under those conditions, Postgres writes the new tuple on the same page, chains it to the old tuple via a HOT chain, and skips writing any index entries at all — existing index entries still point at the old row head, which now redirects through the HOT chain to the current live tuple. Since index writes are frequently the dominant component of both WAL volume and IO cost, HOT is a large optimization for update-heavy workloads on lightly-indexed hot tables.

Why the knobs matter together¶

HOT's free-space precondition makes fillfactor a load-bearing parameter. A Postgres table stores rows in 8 KiB pages; fillfactor=100 (the default) fills pages completely on insert. Dropping fillfactor to e.g. 80% means inserts leave 20% of each page free, giving subsequent HOT updates somewhere to go without spilling to a new page. Datadog set fillfactor=80 on their host_last_ingested table specifically so their once-per-day-per-host timestamp updates would qualify for HOT (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). They also kept last_ingested unindexed — in combination with the fillfactor, this guarantees updates touch only the heap page.

Three knobs, one goal — minimize WAL volume per update:

Knob	Mechanism	Datadog value
Move mutating columns to a dedicated narrow table	Keep unchanged columns out of the new-tuple copy	`host_last_ingested` split out from main `hosts` table
Keep the mutating column unindexed	Make updates index-neutral so HOT is applicable	No index on `last_ingested`
`fillfactor` < 100	Leave free space on each page so HOT stays on-page	`fillfactor=80`

What HOT does not save you from¶

HOT reduces the cost of updates that actually happen. It does not help with:

Row locks taken without an update. INSERT ... ON CONFLICT DO UPDATE locks the conflicting row before evaluating the WHERE predicate; the lock mutates tuple metadata (the xid) and emits a Heap LOCK_ONLY WAL record + a COMMIT record, even when WHERE filters the row out and no HOT update ever happens (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). HOT composes with the update cost, not the lock cost. This is the subtle failure mode that cost Datadog a 2× IOPS / 4× WAL-sync regression on a table specifically designed to maximize HOT applicability.
VACUUM / bloat. Old tuple versions still accumulate, still need to be vacuumed. HOT chains have their own traversal cost.
Primary-key updates. Primary key is always indexed; an update to it forces index writes.

Relation to other wiki concepts¶

concepts/wal-write-ahead-logging — MVCC + HOT are WAL-volume conservation tactics. WAL is the budget; MVCC+HOT minimize records per logical operation.
patterns/cte-emulated-upsert — pattern-level answer to the "HOT doesn't help with lock-only WAL records" failure mode — restructure the query so the common path emits no WAL records at all.

Seen in¶

sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog's host_last_ingested table design (dedicated narrow table + unindexed last_ingested + fillfactor=80) for once-per-day HOT updates; the ensuing discovery that HOT qualification is irrelevant when the query path emits a pre-update lock WAL record per call. After switching to a patterns/cte-emulated-upsert, pg_walinspect confirmed HOT updates firing as designed (Heap HOT_UPDATE + Transaction COMMIT, no Btree records) on rows older than 1 day.

The visibility half: `ProcArray` and snapshots¶

Write amplification (above) is one half of Postgres MVCC — the per-UPDATE cost of producing a new tuple version. The other half is how readers decide which versions are visible to them: every snapshot records a list of currently-pending xids (the in-memory ProcArray) and permanently excludes those xids from the snapshot even after they commit. A committing transaction (a) writes its WAL commit record (durable), then (b) asynchronously removes its xid from ProcArray (visible).

That decoupling is where concepts/visibility-order-vs-commit-order can skew and admit the Long Fork anomaly across primary + replica (Source: sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas). The upstream fix under discussion is Commit Sequence Numbers — stamp a monotonic CSN on each commit and snapshot by CSN watermark instead of ProcArray scan. systems/aurora-dsql and systems/aurora-limitless sidestep via time-based MVCC — snapshot = clock read, visibility = commit-time comparison — replacing the ProcArray substrate wholesale via Postgres-extension surgery (patterns/postgres-extension-over-fork).

Orthogonality: HOT updates + fillfactor + narrow tables are about minimizing per-UPDATE cost within the existing visibility model. CSN / time-based MVCC change the visibility model itself. Both halves compose — you can imagine a future Postgres that is both HOT-optimized and CSN-visible.

The third half: horizon, vacuum, and queue-table workloads¶

Write amplification (HOT updates) and visibility (ProcArray / CSN) are two axes; the third is what stops dead tuples from being reclaimed. The MVCC horizon — set by the oldest active transaction anywhere on the cluster — defines the point below which autovacuum can clean up dead tuples. A single long-running transaction or a chain of continuously overlapping shorter ones pins the horizon and prevents reclamation regardless of HOT-qualification or visibility substrate.

This matters most for high-write-churn workloads at steady-state size — Postgres queue tables are the canonical instance. Even with SKIP LOCKED (which eliminates claim serialization) and Postgres 14+ B-tree bottom-up deletion (which opportunistically cleans up dead index tuples during inserts), the horizon-pinned failure mode remains unchanged from the 2015 Brandur demonstration. Upstream Postgres timeouts target single-query duration; the structural fix is a workload-class resource budget (exemplar: PlanetScale's Traffic Control) that limits concurrency of low-priority workload classes so horizon windows open up for vacuum to work in (Source: ).

Orthogonality again: HOT updates make each mutation cheap; CSN changes the visibility model; horizon management is what lets accumulated debris actually go away. All three halves compose.