CONCEPT Cited by 2 sources
Replication lag¶
Definition¶
Replication lag is the time between a write being applied on the primary server of a replicated database cluster and that same write becoming visible on a replica (or secondary / follower) server.
Formally, for a write W committed at time t_primary on the
primary and first readable at time t_replica on replica R:
Because a single replica consumes the primary's changelog in order, lag is often reported as a scalar per replica: the delay of the most recent write that replica has applied relative to wall-clock time.
Why replication lag matters¶
- Failover time. When the primary fails and a replica is promoted, the replica can accept writes only after it has applied all of the primary's outstanding changelog. Replication lag is a lower bound on the promotion-delay component of RTO.
- Read-your-writes consistency. If a client writes on the
primary and immediately reads from a replica, the read only
sees the write if
lag < (read_delay). Low lag makes it tractable to serve read traffic off replicas without violating per-user read-your-writes. - Throttling signal. Lag is the most-used database-throttler signal in the MySQL world: easy to measure, clear business impact, and directly reflects replica ability to keep up with primary load.
Where the lag comes from¶
Replication lag is a summary of an entire queue: concepts/queueing-theory. A single write event traverses a chain of queues:
primary commit
→ write to binlog (local disk queue)
→ ship across network to replica (network queue)
→ wait in replica's relay log (local disk queue)
→ wait for SQL apply thread (apply queue)
→ apply to replica storage (storage queue)
→ visible to readers
Each of these queues can be the bottleneck. A sudden replication- lag spike does not, on its own, tell you which one — only that the chain is backing up somewhere. This is the symptom-vs-cause-metric property: the metric is useful because it summarises the chain, not because it instruments any single queue.
Measurement: heartbeat injection¶
Because lag is a gap, not an observable on any single machine, the standard measurement technique is heartbeat injection:
- A controller deliberately writes a timestamped heartbeat row on the primary at a known interval (e.g. 1 per second).
- Replicas apply the heartbeat row through the normal replication path.
- At read time, a monitor on the replica reads the most recent
heartbeat timestamp and subtracts it from wall-clock:
lag ≈ now() - max(heartbeat_ts).
Granularity is bounded by the heartbeat interval: a 1-second interval cannot distinguish 0.2 s lag from 0.8 s lag on any given sample, though averaging over many samples narrows the estimate.
Alternatives such as seconds-behind-master derived from the primary's binlog position read the event position directly but are subject to clock-skew between primary and replica and to the lag in executing the position-reporting command itself.
Threshold setting¶
For a throttler, acceptable replication lag is a business- derived threshold derived from:
- Failover RTO tolerance — lag > RTO means longer promotions.
- Read-your-writes requirement — lag < read-latency tolerance for reads off replicas.
- Downstream tooling tolerance — some consumers (analytics, CDC pipelines) assume bounded lag.
Typical operational ranges named in the wiki corpus:
- < 1 s — normal steady state for well-tuned MySQL clusters.
- 5 s — example throttler threshold in Noach's worked example (sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1).
- > 30 s — deep degradation; promotion penalties and read-your-writes break.
Seen in¶
- sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1 — "replication lag is probably the single most used throttling indicator" in the MySQL world. Noach frames it as the canonical throttling signal because it is easy to measure and has clear business impact (failover time, read-your-writes). Also the canonical worked example of the threshold-as-standard dynamic: "what we consider as the throttling threshold (say, a 5 sec replication lag) becomes the actual standard" for workloads large enough to push against it.
- sources/2026-02-16-planetscale-zero-downtime-migrations-at-petabyte-scale — Matt Lord's petabyte-scale migration post uses replication-lag and binlog-replication mechanisms as the underlying substrate for Vitess VReplication's copy + catch-up interleaving. Migration throttling must respect replication-lag budget to stay inside binlog retention horizon.
Related¶
- concepts/binlog-replication — the MySQL mechanism from which replication lag is derived.
- concepts/gtid-position — replication-position primitive in GTID-based deployments; lag can be expressed in transactions rather than seconds.
- concepts/asynchronous-replication — lag is expected and non-zero under async replication; synchronous replication eliminates it at a latency cost.
- concepts/symptom-vs-cause-metric — why lag is a useful summary of a queue chain, not a cause.
- patterns/heartbeat-based-replication-lag-measurement — the canonical measurement mechanism.