CONCEPT Cited by 2 sources
Postgres MVCC and Heap-Only Tuple (HOT) updates¶
Definition¶
MultiVersion Concurrency Control (MVCC) in Postgres is implemented by writing a
new row version ("tuple") for every UPDATE rather than modifying the existing
row in place. Old versions remain visible to transactions whose snapshots predate
the update, and are cleaned up later by VACUUM. This makes readers lock-free at
the cost of write amplification: a single logical row update produces at least
one new heap tuple plus one new index entry per index on the table.
Heap-Only Tuple (HOT) updates are the optimization that short-circuits the index-update cost when two conditions both hold:
- The update does not modify any indexed column.
- There is free space on the same heap page as the original tuple to hold the new tuple.
Under those conditions, Postgres writes the new tuple on the same page, chains it to the old tuple via a HOT chain, and skips writing any index entries at all — existing index entries still point at the old row head, which now redirects through the HOT chain to the current live tuple. Since index writes are frequently the dominant component of both WAL volume and IO cost, HOT is a large optimization for update-heavy workloads on lightly-indexed hot tables.
Why the knobs matter together¶
HOT's free-space precondition makes fillfactor a load-bearing parameter. A Postgres
table stores rows in 8 KiB pages; fillfactor=100 (the default) fills pages
completely on insert. Dropping fillfactor to e.g. 80% means inserts leave 20%
of each page free, giving subsequent HOT updates somewhere to go without spilling
to a new page. Datadog set fillfactor=80 on their host_last_ingested table
specifically so their once-per-day-per-host timestamp updates would qualify for HOT
(Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). They also
kept last_ingested unindexed — in combination with the fillfactor, this
guarantees updates touch only the heap page.
Three knobs, one goal — minimize WAL volume per update:
| Knob | Mechanism | Datadog value |
|---|---|---|
| Move mutating columns to a dedicated narrow table | Keep unchanged columns out of the new-tuple copy | host_last_ingested split out from main hosts table |
| Keep the mutating column unindexed | Make updates index-neutral so HOT is applicable | No index on last_ingested |
fillfactor < 100 |
Leave free space on each page so HOT stays on-page | fillfactor=80 |
What HOT does not save you from¶
HOT reduces the cost of updates that actually happen. It does not help with:
- Row locks taken without an update.
INSERT ... ON CONFLICT DO UPDATElocks the conflicting row before evaluating theWHEREpredicate; the lock mutates tuple metadata (the xid) and emits aHeap LOCK_ONLYWAL record + a COMMIT record, even whenWHEREfilters the row out and no HOT update ever happens (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal). HOT composes with the update cost, not the lock cost. This is the subtle failure mode that cost Datadog a 2× IOPS / 4× WAL-sync regression on a table specifically designed to maximize HOT applicability. - VACUUM / bloat. Old tuple versions still accumulate, still need to be vacuumed. HOT chains have their own traversal cost.
- Primary-key updates. Primary key is always indexed; an update to it forces index writes.
Relation to other wiki concepts¶
- concepts/wal-write-ahead-logging — MVCC + HOT are WAL-volume conservation tactics. WAL is the budget; MVCC+HOT minimize records per logical operation.
- patterns/cte-emulated-upsert — pattern-level answer to the "HOT doesn't help with lock-only WAL records" failure mode — restructure the query so the common path emits no WAL records at all.
Seen in¶
- sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog's
host_last_ingestedtable design (dedicated narrow table + unindexedlast_ingested+fillfactor=80) for once-per-day HOT updates; the ensuing discovery that HOT qualification is irrelevant when the query path emits a pre-update lock WAL record per call. After switching to a patterns/cte-emulated-upsert,pg_walinspectconfirmed HOT updates firing as designed (Heap HOT_UPDATE+Transaction COMMIT, no Btree records) on rows older than 1 day.
The visibility half: ProcArray and snapshots¶
Write amplification (above) is one half of Postgres MVCC — the per-UPDATE
cost of producing a new tuple version. The other half is how readers
decide which versions are visible to them: every snapshot records a list
of currently-pending xids (the in-memory ProcArray) and permanently
excludes those xids from the snapshot even after they commit. A committing
transaction (a) writes its WAL commit
record (durable), then (b) asynchronously removes its xid from
ProcArray (visible).
That decoupling is where concepts/visibility-order-vs-commit-order can
skew and admit the Long Fork anomaly across
primary + replica (Source:
sources/2025-05-03-aws-postgresql-transaction-visibility-read-replicas).
The upstream fix under discussion is
Commit Sequence Numbers — stamp a
monotonic CSN on each commit and snapshot by CSN watermark instead of
ProcArray scan. systems/aurora-dsql and systems/aurora-limitless
sidestep via time-based MVCC — snapshot = clock read, visibility =
commit-time comparison — replacing the ProcArray substrate wholesale via
Postgres-extension surgery (patterns/postgres-extension-over-fork).
Orthogonality: HOT updates + fillfactor + narrow tables are about minimizing per-UPDATE cost within the existing visibility model. CSN / time-based MVCC change the visibility model itself. Both halves compose — you can imagine a future Postgres that is both HOT-optimized and CSN-visible.