Skip to content

PATTERN Cited by 1 source

Heartbeat-based replication lag measurement

Problem

Replication lag is the time between a write landing on the primary and that write becoming visible on a replica. It is a gap, not an observable on any single server. Naive measurements are fragile:

  • Byte-position (e.g. SHOW SLAVE STATUS → Seconds_Behind_Master) is subject to clock-skew between primary and replica and to reporting staleness.
  • Last-replayed timestamp assumes the changelog carries wall-clock time; often the replica reapplies the primary's event timestamp, which may not reflect network delay.
  • Direct primary-replica comparison requires cross-machine time synchronisation to tighter tolerances than the lag threshold itself.

Solution

Inject deliberate heartbeat rows on the primary at a known interval, and on the replica compute lag as (now - heartbeat_timestamp):

-- on the primary, run by a heartbeat injector process:
REPLACE INTO heartbeat(id, ts) VALUES (1, NOW(6));
-- runs every heartbeat_interval (e.g. 1 s)

-- on the replica, run by the monitor / throttler:
SELECT NOW(6) - ts FROM heartbeat WHERE id = 1;

The heartbeat row flows through the normal replication path:

primary INSERT
  → binlog event
  → ship across network to replica
  → relay log
  → SQL apply thread
  → replica storage
  → visible to monitor query

The gap between injection time and observation time is replication lag, measured end-to-end without requiring cross- machine clock sync. The replica's local clock is used for both ts (via replication — ts was written on the primary but the replica still reads it as a plain row) and now(), so clock skew cancels out on the primary side if ts is written by the replica's SQL apply thread — though in practice the ts is often the primary's wall clock, and a separate controller correction is needed. Tools that ship heartbeats (e.g. pt-heartbeat) handle this detail.

Shlomi Noach's framing

"Replication lag can be measured in different methods, and the most common one is by deliberate injection of heartbeat events on the primary, and by capturing them on a replica."

— Shlomi Noach, Anatomy of a Throttler, part 1

Granularity = heartbeat interval

The effective granularity of the lag signal is the heartbeat-injection interval. A heartbeat fired once per second gives at best ~1-second measurement granularity, plus sample-phase jitter (see concepts/metric-sampling-interval):

  • Heartbeat at 12:00:00.000, next at 12:00:01.000.
  • Monitor sample at 12:00:00.995 reads heartbeat timestamp 12:00:00.000, so lag ≈ 0.995s.
  • Monitor sample at 12:00:01.001 reads heartbeat timestamp 12:00:01.000, so lag ≈ 0.001s.

The apparent lag can swing by nearly the full heartbeat interval across consecutive samples even if actual lag is constant.

Tuning: oversample the threshold range

For a throttler with threshold T, Noach recommends heartbeat + sample intervals at ~T / 2.5 to T / 5:

"If the acceptable replication lag is at 5 seconds, then it's best to have a heartbeat/sampling interval of 1–2 seconds."

See concepts/oversampling-metric-interval for the rule of thumb.

Cost

  • Primary-side write load. Every heartbeat is one extra row through the binlog. At 1 heartbeat/s this is negligible; at 10 heartbeats/s on a 100-shard fleet it adds up.
  • Replica-side read load. Every monitor query is one extra read; can be amortised by caching the last lag reading.
  • Operational overhead. Heartbeat injector must be HA (fails over with the primary), must start up cleanly, must not collide with application use of the heartbeat table.

Implementation references

  • pt-heartbeat (part of Percona Toolkit) — the canonical external tool; runs as a daemon, maintains the heartbeat table and injection loop.
  • Vitess built-in heartbeat — VTTablet's --heartbeat_enable and --heartbeat_interval flags provide in-process heartbeat injection managed alongside the tablet lifecycle.
  • Application-level heartbeat — some deployments inject heartbeats from application code (e.g. one per minute from a cron), trading latency resolution for simpler operational shape.

Seen in

Last updated · 319 distilled / 1,201 read