Skip to content

CONCEPT Cited by 1 source

Replication lag in message count

Definition

Replication lag in message count is the measurement dimension where cross-cluster replication lag is reported as the number of source-cluster messages not yet replicated to the destination cluster — rather than (or alongside) wall-clock time lag. Converting to wall-clock RPO is a division:

RPO (seconds) = lag (messages) / message-throughput (messages/second)

Message-count lag is the native unit inside the broker: replication tasks see offsets and message counts, not wall-clocks. Wall-clock RPO is a derived quantity that requires the current production rate to compute.

Canonical wiki source

Canonicalised by the 2026-04-21 Redpanda Shadow Linking deep-dive:

"I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."

The post disclosure is the canonical wiki lag-as-messages → RPO-as-time computation: 10,000 msg / 2,500,000 msg/s = 4 ms.

Why the unit matters

Message-count is the native broker quantity

Inside the broker, every record has a monotonic offset. Lag is the offset delta between source's latest and destination's latest replicated offset — a count, not a duration. The broker can report this without needing to reason about wall-clock time at all. Kafka's consumer_lag metric family is historically in message-count units for the same reason: it's what the broker actually knows.

Wall-clock RPO is production-rate-dependent

A 10,000-message lag means:

  • 4 ms at 2.5 M msg/s — the Shadow Linking scale-test result.
  • 40 ms at 250 K msg/s.
  • 10 seconds at 1000 msg/s.

The same underlying replication state produces very different wall-clock RPO depending on the workload. Steady-state RPO requires both the lag number and the current production rate.

Burst traffic breaks the naive time-lag metric

If the broker reports "replication lag: 2 seconds", this is the time between the source's latest record's timestamp and the destination's latest record's timestamp. During a burst, both timestamps are close but the message count between them can be huge — the time-lag metric stays low while the actual backlog grows. Message-count lag tracks this correctly: it grows with the backlog regardless of the timestamp gap.

Conversely, during a quiet period, a single old un-replicated message keeps the time-lag high while the actual backlog is one message.

Alert thresholds must be chosen carefully

For SLA reporting, wall-clock RPO is what business stakeholders care about. For operational alerting, message-count lag is what the broker natively exposes and what stays stable across throughput changes. Best practice is to alert on message-count lag ("more than N messages behind") and report SLA in wall-clock RPO computed from the message-count lag and the current throughput.

Observability surface (Redpanda Shadowing)

The 2026-04-21 post names three observability surfaces that expose replication lag:

  1. Prometheus-compatible metrics. "Prometheus-compatible metrics to see the link status, including replication lag, are published by the broker."
  2. Redpanda Console — interactive UI.
  3. rpk + REST — scripting.

The metric unit is not explicitly named in the post, but the 4 ms / 10,000 msg / 2.5 M msg/s worked example implies message- count is the primitive with RPO derived from it.

Relationship to RPO / RTO

  • RPO is the wall-clock-time view: how much data (measured in time) could be lost at failover.
  • Replication-lag-in-messages is the broker-native view of the same property: how many messages could be lost.
  • The two are related through throughput.

A hot-standby DR shape (pattern) whose SLA is "RPO < 5 seconds" translates to different message-count lag targets at different throughputs:

  • At 2.5 M msg/s: lag must stay under ~12.5 M messages.
  • At 100 K msg/s: lag must stay under ~500 K messages.
  • At 1 K msg/s: lag must stay under 5 K messages.

For capacity planning of cross-region bandwidth, the peak message-count lag during burst periods is a more stable planning quantity than the wall-clock RPO at any specific moment.

Contrast with time-based replication lag

The sister concept replication lag is typically measured in wall-clock time (seconds behind primary). Both measurements are useful:

  • Time-lag is portable across throughputs — 5 seconds means 5 seconds regardless of rate.
  • Message-count lag is stable within a workload and portable across clusters with different throughput characteristics.

For replication SLAs on streaming systems, reporting both — "lag: 8,000 messages (3.2 ms at current rate)" — is the best practice. The Shadow Linking deep-dive takes this shape implicitly: it reports the message count and the derived wall- clock RPO together.

Seen in

Last updated · 550 distilled / 1,221 read