Skip to content

CONCEPT Cited by 1 source

Replication lag

Definition

Replication lag is the time between a write being applied on the primary server of a replicated database cluster and that same write becoming visible on a replica (or secondary / follower) server.

Formally, for a write W committed at time t_primary on the primary and first readable at time t_replica on replica R:

lag(R, W) = t_replica(W) - t_primary(W)

Because a single replica consumes the primary's changelog in order, lag is often reported as a scalar per replica: the delay of the most recent write that replica has applied relative to wall-clock time.

Why replication lag matters

  1. Failover time. When the primary fails and a replica is promoted, the replica can accept writes only after it has applied all of the primary's outstanding changelog. Replication lag is a lower bound on the promotion-delay component of RTO.
  2. Read-your-writes consistency. If a client writes on the primary and immediately reads from a replica, the read only sees the write if lag < (read_delay). Low lag makes it tractable to serve read traffic off replicas without violating per-user read-your-writes.
  3. Throttling signal. Lag is the most-used database-throttler signal in the MySQL world: easy to measure, clear business impact, and directly reflects replica ability to keep up with primary load.

Where the lag comes from

Replication lag is a summary of an entire queue: concepts/queueing-theory. A single write event traverses a chain of queues:

primary commit
  → write to binlog (local disk queue)
  → ship across network to replica (network queue)
  → wait in replica's relay log (local disk queue)
  → wait for SQL apply thread (apply queue)
  → apply to replica storage (storage queue)
  → visible to readers

Each of these queues can be the bottleneck. A sudden replication- lag spike does not, on its own, tell you which one — only that the chain is backing up somewhere. This is the symptom-vs-cause-metric property: the metric is useful because it summarises the chain, not because it instruments any single queue.

Measurement: heartbeat injection

Because lag is a gap, not an observable on any single machine, the standard measurement technique is heartbeat injection:

  1. A controller deliberately writes a timestamped heartbeat row on the primary at a known interval (e.g. 1 per second).
  2. Replicas apply the heartbeat row through the normal replication path.
  3. At read time, a monitor on the replica reads the most recent heartbeat timestamp and subtracts it from wall-clock: lag ≈ now() - max(heartbeat_ts).

Granularity is bounded by the heartbeat interval: a 1-second interval cannot distinguish 0.2 s lag from 0.8 s lag on any given sample, though averaging over many samples narrows the estimate.

Alternatives such as seconds-behind-master derived from the primary's binlog position read the event position directly but are subject to clock-skew between primary and replica and to the lag in executing the position-reporting command itself.

Threshold setting

For a throttler, acceptable replication lag is a business- derived threshold derived from:

  • Failover RTO tolerance — lag > RTO means longer promotions.
  • Read-your-writes requirement — lag < read-latency tolerance for reads off replicas.
  • Downstream tooling tolerance — some consumers (analytics, CDC pipelines) assume bounded lag.

Typical operational ranges named in the wiki corpus:

Seen in

  • — Brian Morrison II (PlanetScale, 2023-11-15) canonicalises the monitoring-is-mandatory framing: "All infrastructure requires monitoring to catch issues proactively, and replication is no exception. If left unmonitored, you'd have no idea whether or not your data is actually being replicated once it's configured." Load-bearing observation: replication has no intrinsic loud-failure signal — a replica that stops pulling binlog silently drifts into unbounded lag, and writes on the primary still succeed. Active lag monitoring is the only way to notice before a failover exposes a drift-deep replica as the "failover candidate" that's actually hours behind. Canonicalises the cross-region lag amplification: "Replication in itself has a bit of a delta between the time that data is written to the source and the time it is written to a replica, known as replication lag. This is exacerbated when replicating across longer distances." Cross-AZ (single-digit ms) vs cross-region (60ms+ us-east-1us-west-1 per cloudping.co) is the canonical denominator for understanding why cross-region read replicas cannot serve read-your-writes workflows without sticky routing. Names two monitoring stacks: SolarWinds Database Performance (formerly VividCortex) and PrometheusPrometheus as PlanetScale's fleet-metric tier underneath systems/planetscale-insights (query-tier) and systems/vitess-throttler (replication-lag-driven admission control).

  • sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1"replication lag is probably the single most used throttling indicator" in the MySQL world. Noach frames it as the canonical throttling signal because it is easy to measure and has clear business impact (failover time, read-your-writes). Also the canonical worked example of the threshold-as-standard dynamic: "what we consider as the throttling threshold (say, a 5 sec replication lag) becomes the actual standard" for workloads large enough to push against it.

  • — Matt Lord's petabyte-scale migration post uses replication-lag and binlog-replication mechanisms as the underlying substrate for Vitess VReplication's copy + catch-up interleaving. Migration throttling must respect replication-lag budget to stay inside binlog retention horizon.
  • — Berquist canonicalises replication lag as the leading indicator of the write- throughput ceiling: "When the primary is maxed on IOPS, writes will become less performant. Usually before that, however, replication lag becomes a problem." The lag symptom chain (lag → stale reads → read-your-writes errors) is the early-warning signal that a single-primary topology is approaching capacity — fires before the IOPS saturation at the primary itself.
  • — Taylor Barnett's Portals launch uses replication-lag as the load-bearing failure mode to argue for the session-cookie read-your-writes window: "After each write, it will set a cookie that will send all reads to the primary for 2 seconds, allowing users to read their own writes … This protects our users from ever reading stale data due to replication lag." The 2 s window must exceed cross-region replication-lag p99 to preserve RYW under regional read replica routing.
Last updated · 542 distilled / 1,571 read