Skip to content

DATADOG 2026-03-23 Tier 3

Read original ↗

Datadog — When upserts don't update but still write: debugging Postgres performance at scale

Summary

Datadog's host-metadata team added an INSERT ... ON CONFLICT DO UPDATE upsert to track last_ingested per host on a dedicated, unindexed, fillfactor=80 table — engineered specifically to qualify for Postgres Heap-Only Tuple (HOT) updates and to behave as a no-op on the ~99.9% of calls where the row already exists and was updated within the last day. After rollout, disk write IOPS more than doubled and WAL syncs quadrupled on a small data center running only ~500 upserts/sec, even though almost no rows were actually being updated. Using the pg_walinspect extension (Postgres 15+) plus an lldb breakpoint on WALInsertLockAcquire, they proved that ON CONFLICT DO UPDATE locks every conflicting row before evaluating the WHERE clause, each lock mutates tuple metadata to store a transaction ID, the xid forces a COMMIT record, and each COMMIT forces a WAL fsync. At the planned 25,000 upserts/sec scale this would have exceeded the single-writer sync budget of their gp3 EBS-backed cluster (pg_test_fsync measured ~1,000 8-KiB fsyncs/sec before Postgres starts batching commits and adding latency). Fix: rewrite the query as a data-modifying CTEINSERT ... ON CONFLICT DO NOTHING as the CTE, then a separate UPDATE gated on NOT EXISTS (SELECT FROM insert_attempt) — which trades the implicit row lock for a small tolerable race (concurrent deletion between the insert attempt and the update) acceptable because host-liveness tracking is inherently imprecise. Re-rolled: WAL syncs now scale with real update rate, not call rate.

Key takeaways

  1. INSERT ... ON CONFLICT DO UPDATE locks the conflicting row before evaluating the WHERE condition — even if the WHERE ultimately returns false and no row is modified. The Postgres INSERT docs state this explicitly: "all rows will be locked when the ON CONFLICT DO UPDATE action is taken". In the backtrace, the upsert path goes ExecInsert → ExecOnConflictUpdate → table_tuple_lock → heapam_tuple_lock → heap_lock_tuple → XLogInsert → WALInsertLockAcquire; the lock is written before the update predicate even runs. No-op upserts are therefore not free; each one costs one WAL sync.
  2. Taking a row lock assigns a transaction ID, which forces a COMMIT record, which forces an fsync. Each no-op upsert produced two WAL records: a Heap LOCK_ONLY / EXCL_LOCK record for the implicit row lock, and a Transaction COMMIT record because the xid assigned by the lock must either commit or abort. The COMMIT forces the WAL to disk. 500 upserts/sec therefore = 500 WAL syncs/sec = 500 IOPS minimum, regardless of how many rows actually changed.
  3. A Postgres cluster has a hard single-writer fsync budget; at 25,000 upserts/sec that budget dominates design. pg_test_fsync on a gp3 EBS volume measured ~1,000 fsyncs/sec on 8-KiB writes; beyond that Postgres starts batching commits, which adds latency for every transaction on the cluster, not just the misbehaving one. "No-op" queries that each cost an fsync are therefore not tail-latency-safe at scale.
  4. Postgres 15's pg_walinspect extension makes WAL behavior first-class debuggable. CREATE EXTENSION pg_walinspect; plus two ~/.psqlrc shortcuts (pg_current_wal_lsn checkpoint + pg_get_wal_records_info(:lsn, 'FFFFFFFF/FFFFFFFF')) let them reproduce the upsert in a psql session and see the individual Heap/Btree/Transaction WAL records generated, with resource-manager type and xid. Together with an lldb breakpoint on WALInsertLockAcquire on the backend PID, this gave both what WAL records were emitted and why (the C-level call stack that emitted them). This is a clean instance of patterns/bisect-driven-regression-hunt step 4 — "drop one observability layer below where the problem is invisible": application-level metrics said the upsert was a no-op, WAL-level introspection revealed the hidden writes.
  5. The table design was already aggressive on write-amplification reduction. The team had already: (a) moved last_ingested out of the main hosts table into a dedicated host_last_ingested table so unchanged host metadata wouldn't be copied on every MVCC new-tuple write; (b) left last_ingested unindexed so updates could qualify for HOT (Heap-Only Tuple) updates that skip index-entry writes; (c) set the table fillfactor=80 so each 8-KiB page has free space for subsequent HOT updates to stay on the same page. None of these mitigations helped against the ON CONFLICT DO UPDATE lock path, because the fsync cost was paid before the HOT path was even considered. HOT optimizations compose with the update cost, not the lock cost (concepts/postgres-mvcc-hot-updates).
  6. The fix is a data-modifying CTE that splits INSERT-or-nothing from conditional UPDATE. WITH insert_attempt AS (INSERT ... ON CONFLICT DO NOTHING RETURNING ...) UPDATE ... WHERE host_id=:host_id AND last_ingested < now() - '1 day'::interval AND NOT EXISTS (SELECT FROM insert_attempt);. ON CONFLICT DO NOTHING does not acquire a row lock on the existing conflicting row; the separate UPDATE only touches rows that actually match the time-gated predicate. For rows more recent than 1 day old, neither statement writes anything — no Heap LOCK, no xid, no COMMIT record, no fsync. patterns/cte-emulated-upsert generalizes the shape.
  7. The CTE-emulated upsert introduces a small race window, accepted by analysis. Because there is no implicit row lock, a concurrent DELETE between the INSERT ON CONFLICT DO NOTHING (which sees the row exists, does nothing) and the outer UPDATE (which now finds no row to update) can cause the operation to silently skip. Datadog's analysis: host-liveness tracking is inherently imprecise and returning-hosts after 7 days of inactivity is improbable, so the race is tolerable — an explicit correctness-vs-throughput trade-off with the trade quantified against the workload's semantics rather than simply hidden.
  8. WAL-record verification closes the loop on the fix. Running pg_walinspect against the CTE-emulated upsert confirmed three distinct outcomes: row absent → 5 records (Heap INSERT+INIT, Btree NEWROOT, Btree INSERT_LEAF, Heap HEAP_CONFIRM — speculative-insert confirmation under ON CONFLICT DO NOTHING — and Transaction COMMIT); row present and older than 1 day → 2 records (Heap HOT_UPDATE + Transaction COMMIT, with HOT confirmed working — no index-write records); row present and within 1 day → zero WAL records. Post-rollout on the small data center, WAL syncs tracked actual update rate with no baseline elevation.

Numbers / concrete measurements

  • Initial rollout scale: ~500 upserts/sec on a small data center.
  • Target scale for largest data centers: 25,000 upserts/sec.
  • Disk write IOPS impact of the buggy query: more than doubled on ~500 upserts/sec.
  • WAL syncs impact: ~4× increase (quadrupled) on the same workload.
  • Fsync capacity of a gp3 EBS disk (measured with pg_test_fsync, 8-KiB writes): ~1,000 fsyncs/sec before Postgres begins batching commits + adding latency.
  • Upsert cost in WAL records when the row already exists (buggy query, no update applied): 2 recordsHeap LOCK_ONLY/EXCL_LOCK + Transaction COMMIT — each query triggers one WAL sync.
  • Freshness granularity: 1 day (sufficient because deletion threshold is 7 days of inactivity).
  • Table design constants: fillfactor=80, no index on last_ingested, primary key on host_id only.

WAL record taxonomy observed

Scenario WAL records (in order)
Buggy upsert on existing row (no update applied) Heap LOCK_ONLY/EXCL_LOCK; Transaction COMMIT
CTE-emulated upsert on new row Heap INSERT+INIT; Btree NEWROOT; Btree INSERT_LEAF; Heap HEAP_CONFIRM; Transaction COMMIT
CTE-emulated upsert on existing row, ≥1 day old Heap HOT_UPDATE; Transaction COMMIT
CTE-emulated upsert on existing row, <1 day old (none)

Caveats / scope

  • Postgres-specific; the exact mechanism (ON CONFLICT DO UPDATE locks under MVCC) does not generalize to other upsert-offering engines (MySQL INSERT ... ON DUPLICATE KEY UPDATE, SQL Server MERGE, etc.) which have different locking and redo-log semantics.
  • The CTE-emulated upsert's race window is workload-specific. For use cases where missing an update can cause correctness bugs (billing, counters, idempotency state machines), the race is not acceptable and the ON CONFLICT DO UPDATE fsync cost may have to be paid — or a different design chosen (batch + dedup, or a queue- backed consumer).
  • pg_walinspect requires Postgres 15+. Earlier versions required reading WAL dumps with pg_waldump offline, much less ergonomic.
  • pg_test_fsync numbers depend on device + filesystem + mount options; ~1000/sec on gp3 is Datadog's measurement, not a universal Postgres-on-EBS number.
  • No claim about the largest-cluster rollout outcome in the post — the rollout at 25,000/sec scale is described as motivation, not as completed rollout.

Connections into the wiki

  • systems/postgresql — adds the upsert-locks-before-WHERE observation + the pg_walinspect + HOT-update + fillfactor vocabulary to the Postgres system page.
  • concepts/wal-write-ahead-loggingnew page; canonical definition of WAL as the durability primitive + its cost structure (every committed transaction = one fsync) + the toolchain for inspecting it (pg_walinspect, pg_get_wal_stats).
  • concepts/postgres-mvcc-hot-updatesnew page; MVCC new-tuple-per-update model, the HOT path that avoids index writes when no indexed column changed, and fillfactor as the knob that buys free page space for HOT to stay on-page.
  • patterns/cte-emulated-upsertnew pattern; split INSERT ... ON CONFLICT DO NOTHING + conditional UPDATE in one data-modifying CTE to avoid the conflict-path row lock; trades strict atomicity for a known-small concurrent-delete race, only correct when the workload tolerates it.
  • patterns/bisect-driven-regression-hunt — reinforces the "drop one observability layer below where the problem is invisible" step; pg_walinspect + lldb breakpoint on WALInsertLockAcquire is the concrete move that proved which code path emitted the unexpected writes.
  • companies/datadog — adds another Postgres-internals production-debugging case alongside the Go 1.24 memory regression and Husky compaction internals.
Last updated · 200 distilled / 1,178 read