PATTERN Cited by 1 source
Heartbeat-based replication lag measurement¶
Problem¶
Replication lag is the time between a write landing on the primary and that write becoming visible on a replica. It is a gap, not an observable on any single server. Naive measurements are fragile:
- Byte-position (e.g.
SHOW SLAVE STATUS → Seconds_Behind_Master) is subject to clock-skew between primary and replica and to reporting staleness. - Last-replayed timestamp assumes the changelog carries wall-clock time; often the replica reapplies the primary's event timestamp, which may not reflect network delay.
- Direct primary-replica comparison requires cross-machine time synchronisation to tighter tolerances than the lag threshold itself.
Solution¶
Inject deliberate heartbeat rows on the primary at a known
interval, and on the replica compute lag as
(now - heartbeat_timestamp):
-- on the primary, run by a heartbeat injector process:
REPLACE INTO heartbeat(id, ts) VALUES (1, NOW(6));
-- runs every heartbeat_interval (e.g. 1 s)
-- on the replica, run by the monitor / throttler:
SELECT NOW(6) - ts FROM heartbeat WHERE id = 1;
The heartbeat row flows through the normal replication path:
primary INSERT
→ binlog event
→ ship across network to replica
→ relay log
→ SQL apply thread
→ replica storage
→ visible to monitor query
The gap between injection time and observation time is
replication lag, measured end-to-end without requiring cross-
machine clock sync. The replica's local clock is used for both
ts (via replication — ts was written on the primary but the
replica still reads it as a plain row) and now(), so clock skew
cancels out on the primary side if ts is written by the replica's
SQL apply thread — though in practice the ts is often the
primary's wall clock, and a separate controller correction is
needed. Tools that ship heartbeats (e.g. pt-heartbeat) handle
this detail.
Shlomi Noach's framing¶
"Replication lag can be measured in different methods, and the most common one is by deliberate injection of heartbeat events on the primary, and by capturing them on a replica."
— Shlomi Noach, Anatomy of a Throttler, part 1
Granularity = heartbeat interval¶
The effective granularity of the lag signal is the heartbeat-injection interval. A heartbeat fired once per second gives at best ~1-second measurement granularity, plus sample-phase jitter (see concepts/metric-sampling-interval):
- Heartbeat at
12:00:00.000, next at12:00:01.000. - Monitor sample at
12:00:00.995reads heartbeat timestamp12:00:00.000, solag ≈ 0.995s. - Monitor sample at
12:00:01.001reads heartbeat timestamp12:00:01.000, solag ≈ 0.001s.
The apparent lag can swing by nearly the full heartbeat interval across consecutive samples even if actual lag is constant.
Tuning: oversample the threshold range¶
For a throttler with threshold T, Noach recommends heartbeat +
sample intervals at ~T / 2.5 to T / 5:
"If the acceptable replication lag is at 5 seconds, then it's best to have a heartbeat/sampling interval of 1–2 seconds."
See concepts/oversampling-metric-interval for the rule of thumb.
Cost¶
- Primary-side write load. Every heartbeat is one extra row through the binlog. At 1 heartbeat/s this is negligible; at 10 heartbeats/s on a 100-shard fleet it adds up.
- Replica-side read load. Every monitor query is one extra read; can be amortised by caching the last lag reading.
- Operational overhead. Heartbeat injector must be HA (fails over with the primary), must start up cleanly, must not collide with application use of the heartbeat table.
Implementation references¶
pt-heartbeat(part of Percona Toolkit) — the canonical external tool; runs as a daemon, maintains the heartbeat table and injection loop.- Vitess built-in heartbeat — VTTablet's
--heartbeat_enableand--heartbeat_intervalflags provide in-process heartbeat injection managed alongside the tablet lifecycle. - Application-level heartbeat — some deployments inject heartbeats from application code (e.g. one per minute from a cron), trading latency resolution for simpler operational shape.
Seen in¶
- sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1 — canonical wiki framing. Noach introduces the pattern in the course of discussing sampling interval, uses the 1-s heartbeat worked example, and connects it to the oversampling recommendation.
Related¶
- concepts/replication-lag — the metric this pattern measures.
- concepts/metric-sampling-interval — the parent concept governing interval choice.
- concepts/oversampling-metric-interval — the rule-of-thumb for interval selection.
- concepts/database-throttler — the primary consumer of heartbeat-derived lag in the MySQL ecosystem.
- systems/vitess-throttler — canonical implementation that ships with its own heartbeat injector.
- concepts/binlog-replication / concepts/gtid-position — the substrate that heartbeats ride on.