Datadog — When upserts don't update but still write: debugging Postgres performance at scale¶
Summary¶
Datadog's host-metadata team added an INSERT ... ON CONFLICT DO UPDATE upsert to track
last_ingested per host on a dedicated, unindexed, fillfactor=80 table — engineered
specifically to qualify for Postgres Heap-Only Tuple (HOT) updates and to behave as
a no-op on the ~99.9% of calls where the row already exists and was updated within the
last day. After rollout, disk write IOPS more than doubled and WAL syncs quadrupled
on a small data center running only ~500 upserts/sec, even though almost no rows were
actually being updated. Using the pg_walinspect extension (Postgres 15+) plus an
lldb breakpoint on WALInsertLockAcquire, they proved that ON CONFLICT DO UPDATE
locks every conflicting row before evaluating the WHERE clause, each lock mutates
tuple metadata to store a transaction ID, the xid forces a COMMIT record, and each
COMMIT forces a WAL fsync. At the planned 25,000 upserts/sec scale this would have
exceeded the single-writer sync budget of their gp3 EBS-backed cluster (pg_test_fsync
measured ~1,000 8-KiB fsyncs/sec before Postgres starts batching commits and adding
latency). Fix: rewrite the query as a data-modifying CTE — INSERT ... ON CONFLICT
DO NOTHING as the CTE, then a separate UPDATE gated on NOT EXISTS (SELECT FROM
insert_attempt) — which trades the implicit row lock for a small tolerable race
(concurrent deletion between the insert attempt and the update) acceptable because
host-liveness tracking is inherently imprecise. Re-rolled: WAL syncs now scale with
real update rate, not call rate.
Key takeaways¶
INSERT ... ON CONFLICT DO UPDATElocks the conflicting row before evaluating theWHEREcondition — even if theWHEREultimately returns false and no row is modified. The PostgresINSERTdocs state this explicitly: "all rows will be locked when theON CONFLICT DO UPDATEaction is taken". In the backtrace, the upsert path goesExecInsert → ExecOnConflictUpdate → table_tuple_lock → heapam_tuple_lock → heap_lock_tuple → XLogInsert → WALInsertLockAcquire; the lock is written before the update predicate even runs. No-op upserts are therefore not free; each one costs one WAL sync.- Taking a row lock assigns a transaction ID, which forces a COMMIT record, which
forces an fsync. Each no-op upsert produced two WAL records: a
Heap LOCK_ONLY / EXCL_LOCKrecord for the implicit row lock, and aTransaction COMMITrecord because the xid assigned by the lock must either commit or abort. The COMMIT forces the WAL to disk. 500 upserts/sec therefore = 500 WAL syncs/sec = 500 IOPS minimum, regardless of how many rows actually changed. - A Postgres cluster has a hard single-writer fsync budget; at 25,000 upserts/sec
that budget dominates design.
pg_test_fsyncon a gp3 EBS volume measured ~1,000 fsyncs/sec on 8-KiB writes; beyond that Postgres starts batching commits, which adds latency for every transaction on the cluster, not just the misbehaving one. "No-op" queries that each cost an fsync are therefore not tail-latency-safe at scale. - Postgres 15's
pg_walinspectextension makes WAL behavior first-class debuggable.CREATE EXTENSION pg_walinspect;plus two~/.psqlrcshortcuts (pg_current_wal_lsncheckpoint +pg_get_wal_records_info(:lsn, 'FFFFFFFF/FFFFFFFF')) let them reproduce the upsert in a psql session and see the individual Heap/Btree/Transaction WAL records generated, with resource-manager type and xid. Together with anlldbbreakpoint onWALInsertLockAcquireon the backend PID, this gave both what WAL records were emitted and why (the C-level call stack that emitted them). This is a clean instance of patterns/bisect-driven-regression-hunt step 4 — "drop one observability layer below where the problem is invisible": application-level metrics said the upsert was a no-op, WAL-level introspection revealed the hidden writes. - The table design was already aggressive on write-amplification reduction. The
team had already: (a) moved
last_ingestedout of the mainhoststable into a dedicatedhost_last_ingestedtable so unchanged host metadata wouldn't be copied on every MVCC new-tuple write; (b) leftlast_ingestedunindexed so updates could qualify for HOT (Heap-Only Tuple) updates that skip index-entry writes; (c) set the tablefillfactor=80so each 8-KiB page has free space for subsequent HOT updates to stay on the same page. None of these mitigations helped against theON CONFLICT DO UPDATElock path, because the fsync cost was paid before the HOT path was even considered. HOT optimizations compose with the update cost, not the lock cost (concepts/postgres-mvcc-hot-updates). - The fix is a data-modifying CTE that splits INSERT-or-nothing from conditional
UPDATE.
WITH insert_attempt AS (INSERT ... ON CONFLICT DO NOTHING RETURNING ...) UPDATE ... WHERE host_id=:host_id AND last_ingested < now() - '1 day'::interval AND NOT EXISTS (SELECT FROM insert_attempt);.ON CONFLICT DO NOTHINGdoes not acquire a row lock on the existing conflicting row; the separateUPDATEonly touches rows that actually match the time-gated predicate. For rows more recent than 1 day old, neither statement writes anything — no Heap LOCK, no xid, no COMMIT record, no fsync. patterns/cte-emulated-upsert generalizes the shape. - The CTE-emulated upsert introduces a small race window, accepted by analysis.
Because there is no implicit row lock, a concurrent
DELETEbetween theINSERT ON CONFLICT DO NOTHING(which sees the row exists, does nothing) and the outerUPDATE(which now finds no row to update) can cause the operation to silently skip. Datadog's analysis: host-liveness tracking is inherently imprecise and returning-hosts after 7 days of inactivity is improbable, so the race is tolerable — an explicit correctness-vs-throughput trade-off with the trade quantified against the workload's semantics rather than simply hidden. - WAL-record verification closes the loop on the fix. Running
pg_walinspectagainst the CTE-emulated upsert confirmed three distinct outcomes: row absent → 5 records (Heap INSERT+INIT,Btree NEWROOT,Btree INSERT_LEAF,Heap HEAP_CONFIRM— speculative-insert confirmation underON CONFLICT DO NOTHING— andTransaction COMMIT); row present and older than 1 day → 2 records (Heap HOT_UPDATE+Transaction COMMIT, with HOT confirmed working — no index-write records); row present and within 1 day → zero WAL records. Post-rollout on the small data center, WAL syncs tracked actual update rate with no baseline elevation.
Numbers / concrete measurements¶
- Initial rollout scale: ~500 upserts/sec on a small data center.
- Target scale for largest data centers: 25,000 upserts/sec.
- Disk write IOPS impact of the buggy query: more than doubled on ~500 upserts/sec.
- WAL syncs impact: ~4× increase (quadrupled) on the same workload.
- Fsync capacity of a gp3 EBS disk (measured with
pg_test_fsync, 8-KiB writes): ~1,000 fsyncs/sec before Postgres begins batching commits + adding latency. - Upsert cost in WAL records when the row already exists (buggy query, no update
applied): 2 records —
Heap LOCK_ONLY/EXCL_LOCK+Transaction COMMIT— each query triggers one WAL sync. - Freshness granularity: 1 day (sufficient because deletion threshold is 7 days of inactivity).
- Table design constants:
fillfactor=80, no index onlast_ingested, primary key onhost_idonly.
WAL record taxonomy observed¶
| Scenario | WAL records (in order) |
|---|---|
| Buggy upsert on existing row (no update applied) | Heap LOCK_ONLY/EXCL_LOCK; Transaction COMMIT |
| CTE-emulated upsert on new row | Heap INSERT+INIT; Btree NEWROOT; Btree INSERT_LEAF; Heap HEAP_CONFIRM; Transaction COMMIT |
| CTE-emulated upsert on existing row, ≥1 day old | Heap HOT_UPDATE; Transaction COMMIT |
| CTE-emulated upsert on existing row, <1 day old | (none) |
Caveats / scope¶
- Postgres-specific; the exact mechanism (
ON CONFLICT DO UPDATElocks under MVCC) does not generalize to other upsert-offering engines (MySQLINSERT ... ON DUPLICATE KEY UPDATE, SQL ServerMERGE, etc.) which have different locking and redo-log semantics. - The CTE-emulated upsert's race window is workload-specific. For use cases where
missing an update can cause correctness bugs (billing, counters, idempotency state
machines), the race is not acceptable and the
ON CONFLICT DO UPDATEfsync cost may have to be paid — or a different design chosen (batch + dedup, or a queue- backed consumer). pg_walinspectrequires Postgres 15+. Earlier versions required reading WAL dumps withpg_waldumpoffline, much less ergonomic.pg_test_fsyncnumbers depend on device + filesystem + mount options; ~1000/sec on gp3 is Datadog's measurement, not a universal Postgres-on-EBS number.- No claim about the largest-cluster rollout outcome in the post — the rollout at 25,000/sec scale is described as motivation, not as completed rollout.
Connections into the wiki¶
- systems/postgresql — adds the upsert-locks-before-WHERE observation + the
pg_walinspect+ HOT-update + fillfactor vocabulary to the Postgres system page. - concepts/wal-write-ahead-logging — new page; canonical definition of WAL as
the durability primitive + its cost structure (every committed transaction = one
fsync) + the toolchain for inspecting it (
pg_walinspect,pg_get_wal_stats). - concepts/postgres-mvcc-hot-updates — new page; MVCC new-tuple-per-update
model, the HOT path that avoids index writes when no indexed column changed, and
fillfactoras the knob that buys free page space for HOT to stay on-page. - patterns/cte-emulated-upsert — new pattern; split
INSERT ... ON CONFLICT DO NOTHING+ conditionalUPDATEin one data-modifying CTE to avoid the conflict-path row lock; trades strict atomicity for a known-small concurrent-delete race, only correct when the workload tolerates it. - patterns/bisect-driven-regression-hunt — reinforces the "drop one
observability layer below where the problem is invisible" step;
pg_walinspect+lldbbreakpoint onWALInsertLockAcquireis the concrete move that proved which code path emitted the unexpected writes. - companies/datadog — adds another Postgres-internals production-debugging case alongside the Go 1.24 memory regression and Husky compaction internals.
Direct links¶
- Raw:
raw/datadog/2026-03-23-when-upserts-dont-update-but-still-write-debugging-postgres-30ddd4cd.md - Original: https://www.datadoghq.com/blog/engineering/debugging-postgres-performance/
- Postgres
INSERTdocs: https://www.postgresql.org/docs/current/sql-insert.html pg_walinspect: https://www.postgresql.org/docs/current/pgwalinspect.htmlpg_test_fsync: https://www.postgresql.org/docs/current/pgtestfsync.html- HOT storage: https://www.postgresql.org/docs/current/storage-hot.html