Skip to content

CONCEPT Cited by 1 source

Replication heartbeat

Definition

A replication heartbeat is a timestamp row periodically written to a dedicated table on the replication primary; the same row arrives on replicas via normal replication, and the lag for each replica is computed as now() - heartbeat_ts on that replica.

"The most reliable way to evaluate replication lag is by injecting timestamps on a dedicated table on the Primary server, then reading the replicated value on a replica, comparing it with the system time on said replica."

— Shlomi Noach, Source: sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-2

Why this technique dominates

Heartbeats work correctly in every failure mode competing lag-measurement techniques break on:

  • Replica working well — heartbeat timestamp stays close to now.
  • Replica lagging — timestamp gap grows exactly to the lag magnitude.
  • Replication stopped — timestamp gap grows unbounded, directly surfacing the outage.
  • Replication broken / misconfigured — same as stopped from the lag-measurement angle.

Alternative techniques (e.g. SHOW REPLICA STATUS / Seconds_Behind_Source) fail or report misleading zeros in several of these cases.

Canonical tool: pt-heartbeat

pt-heartbeat is the Percona-Toolkit daemon that performs heartbeat injection + measurement. Deployment requirements:

  • Writes happen on the primary only. Running pt-heartbeat in write mode on a replica corrupts the measurement for every downstream.
  • Failover must move the writer. On primary promotion, the heartbeat writer must follow — automated in the failover orchestration.

The injection-interval trade-off

The interval between heartbeat writes is the lag-metric granularity. Write every 100 ms → you can measure sub- second lag. Write every 10 s → you can't see lag below 10 s.

Finer granularity costs more:

  1. Write rate on the primary — linear in 1/interval.
  2. Binlog volume. This is the dominant cost axis at scale: "The heartbeat events are persisted in the binary logs, which are then re-written on the replicas. For some users, the introduction of heartbeats causes a significant increase in binlog generation."
  3. Storage. "With more binlog events having to be persisted, more binary log files are generated per given period of time. These consume more disk space. It is not uncommon to see MySQL deployments where the total size of binary logs is larger than the actual data set."
  4. Backup and retention cost. Binlogs are typically retained + backed up for recovery / audit — the heartbeat tax compounds across the retention window.

Hibernation fits naturally

Because the cost is high, it makes sense to generate heartbeats only when lag measurement is needed — i.e. when a throttler is actively serving requests. Throttler hibernation extends to the heartbeat generator: during idle periods, stop or slow heartbeat injection; re-ignite on first client request.

The cost of re-ignition is a short window where heartbeats are stale and the throttler will conservatively reject — see patterns/idle-state-throttler-hibernation.

Seen in

  • sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-2 — canonical wiki introduction. Shlomi Noach frames the technique as the de-facto lag-measurement primitive in the MySQL world, names pt-heartbeat as the canonical tool, and highlights binlog-size growth as the principal production cost.
Last updated · 319 distilled / 1,201 read