Skip to content

CONCEPT Cited by 1 source

Transaction commit delay

Definition

Transaction commit delay — also called commit queue delay or commit latency — is the time a transaction spends in the commit/flush queue waiting for its changes to be durably written to the transaction log (redo log in InnoDB, WAL in Postgres) before returning success to the client.

It is the queue wait time for the commit queue, analogous to how replication lag is the queue wait time for the changelog queue. See concepts/queueing-theory for the broader framing.

Why it's a useful throttling signal

Unlike threads_running (which conflates commit-queue backup with lock-wait backup with page-cache-miss backup), commit delay isolates the disk-flush queue:

  • High commit delay ⇒ disk / fsync / replication-ack is bottlenecked.
  • Low commit delay ⇒ even if threads_running is high, the cluster is keeping up on the write path.

Noach frames this as an improvement on threads_running:

"A major contributor to a spike in concurrent queries is their inability to complete. Normally, this means they're held back at commit time, i.e. they wait to be written to the transaction (redo) log. And that means they're in the transaction queue, and we can hence measure the transaction queue latency (aka queue delay)."

Anatomy of a Throttler, part 1

Threshold setting is hardware-derived

Unlike threads_running (no stable threshold) and unlike replication lag (business-derived threshold), commit delay has a hardware-derived threshold:

Substrate Typical commit latency
Local NVMe with innodb_flush_log_at_trx_commit=1 Sub-millisecond to low ms
AWS EBS via virtio Few ms
Aurora with 6-node quorum commit Reported to differ significantly from plain RDS

"Transaction commit delay is typically caused by disk write/flush time, and that changes dramatically across hardware. It is a matter of knowing your metrics."

This hardware dependency is good for a throttler: once the operator has calibrated their system's baseline, the threshold is stable against query-mix fluctuations that would throw off threads_running.

Group commit / batched fsync

Modern databases batch multiple concurrent commits into a single fsync to amortise disk-flush cost (InnoDB group commit, Postgres group commit). This decouples commit delay from transaction arrival rate at low load: a commit can wait up to innodb_flush_log_at_timeout (or similar) to be bundled, adding a fixed jitter floor. Throttlers that read commit delay must account for this.

Why it's still a symptom, not a cause

Even commit-delay isolation has limits: a spike can be caused by disk degradation or by a co-tenant workload saturating shared storage or by replication-ack slowdown in synchronous setups. The metric remains a symptom of one specific queue rather than a cause — but a narrower one than threads_running.

Seen in

Last updated · 319 distilled / 1,201 read