PATTERN Cited by 1 source
CTE-emulated upsert¶
Definition¶
CTE-emulated upsert is a Postgres query pattern for "insert-or-update" semantics
that replaces the built-in INSERT ... ON CONFLICT DO UPDATE with a data-modifying
Common Table Expression (CTE) composing two separate statements: an
INSERT ... ON CONFLICT DO NOTHING inside the CTE, followed by a conditional
UPDATE gated on whether the CTE inserted anything.
The motivation is a specific Postgres behavior: ON CONFLICT DO UPDATE locks the
conflicting row before evaluating the WHERE condition, even when the WHERE
would filter the row out and no column would be modified. That row lock mutates
tuple metadata with a transaction ID, which forces a COMMIT record, which forces a
WAL fsync. At scale (Datadog: 25,000 upserts/sec targeted) the fsync cost per no-op
call can exceed the cluster's single-writer sync budget (Source:
sources/2026-03-23-datadog-debugging-postgres-upsert-wal).
Template¶
WITH insert_attempt AS (
INSERT INTO host_last_ingested (host_id)
VALUES (:host_id)
ON CONFLICT DO NOTHING
RETURNING host_id
)
UPDATE host_last_ingested
SET last_ingested = now()
WHERE host_id = :host_id
AND last_ingested < now() - '1 day'::interval
AND NOT EXISTS (SELECT FROM insert_attempt);
Three cases, observed directly via pg_walinspect:
| Row state | What executes | WAL records |
|---|---|---|
| Row absent | CTE inserts; outer UPDATE short-circuits via NOT EXISTS |
INSERT+INIT + Btree NEWROOT + INSERT_LEAF + HEAP_CONFIRM + COMMIT |
| Row present, stale (≥1 day old) | CTE does nothing; outer UPDATE runs | HOT_UPDATE + COMMIT |
| Row present, fresh (<1 day old) | CTE does nothing; outer UPDATE predicate false | (none) |
The empty-records case is the whole point — the common path writes zero WAL records (Source: sources/2026-03-23-datadog-debugging-postgres-upsert-wal).
Why this avoids the lock-WAL cost¶
ON CONFLICT DO NOTHINGdoes not take a row lock on the conflicting row. If the row exists, the statement simply becomes a no-op without mutating tuple metadata, and therefore emits no WAL record.- The outer
UPDATEonly takes a row lock when itsWHEREpredicate actually matches a live row that will be modified. Unmatched rows are not locked. - No
xidis assigned in the fresh-row case; no COMMIT record is required; no fsync is forced.
Trade-off: a small, workload-dependent race¶
Because no row is locked across the CTE → UPDATE boundary, a concurrent DELETE
between the two statements can cause the update to miss a row that existed at the
start of the operation:
- Tx1 runs
INSERT ... ON CONFLICT DO NOTHING: row exists, CTE returns empty. - Tx2
DELETEs the row. - Tx1's outer
UPDATEfinds no matching row, and theNOT EXISTSguard does not trigger a re-insert.
INSERT ... ON CONFLICT DO UPDATE's implicit row lock would have serialized this.
The CTE version does not. The pattern is only correct when the workload tolerates
a low-probability dropped update, and the trade-off should be stated explicitly:
- Datadog's host-liveness tracker tolerates it because host-liveness is inherently imprecise (the deletion threshold is 7 days of inactivity and hosts rarely return after that).
- Billing counters, idempotency state machines, leader-election state, and other correctness-critical stores do not tolerate it; they should keep the implicit lock and pay the fsync cost (or move to a different design entirely — batch + dedup, queue-backed consumer).
Generalizations¶
- Any upsert where the conditional predicate is expected to be false on the common path is a candidate — the goal is always to avoid the per-call lock-then-skip WAL cost.
- The pattern is Postgres-specific; equivalent reasoning does not transfer to MySQL
INSERT ... ON DUPLICATE KEY UPDATEor SQL ServerMERGE, which have different locking and redo-log semantics. - For multi-row upserts, the same shape applies with a
RETURNINGset and a correlated outer UPDATE (WHERE ... AND NOT EXISTS (SELECT FROM insert_attempt WHERE host_id = t.host_id)).
Relation to other wiki concepts¶
- concepts/wal-write-ahead-logging — the budget the pattern protects.
- concepts/postgres-mvcc-hot-updates — the pattern composes with HOT: when
the outer
UPDATEdoes fire, it still qualifies for HOT (unindexed updated column +fillfactor<100), so the common write path is a singleHOT_UPDATEWAL record.
Seen in¶
- sources/2026-03-23-datadog-debugging-postgres-upsert-wal — Datadog replaces
their
INSERT ... ON CONFLICT DO UPDATE WHERE last_ingested < ... - '1 day'query (2 WAL records per no-op call, 500 fsyncs/sec at 500 upserts/sec, untenable at 25,000 upserts/sec target) with the CTE-emulated form. WAL syncs post-rollout tracked actual update rate instead of call rate; the race window is accepted because host-liveness tracking is inherently imprecise.