CONCEPT Cited by 1 source

Replication lag in message count¶

Definition¶

Replication lag in message count is the measurement dimension where cross-cluster replication lag is reported as the number of source-cluster messages not yet replicated to the destination cluster — rather than (or alongside) wall-clock time lag. Converting to wall-clock RPO is a division:

RPO (seconds) = lag (messages) / message-throughput (messages/second)

Message-count lag is the native unit inside the broker: replication tasks see offsets and message counts, not wall-clocks. Wall-clock RPO is a derived quantity that requires the current production rate to compute.

Canonical wiki source¶

Canonicalised by the 2026-04-21 Redpanda Shadow Linking deep-dive:

"I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."

The post disclosure is the canonical wiki lag-as-messages → RPO-as-time computation: 10,000 msg / 2,500,000 msg/s = 4 ms.

Why the unit matters¶

Message-count is the native broker quantity¶

Inside the broker, every record has a monotonic offset. Lag is the offset delta between source's latest and destination's latest replicated offset — a count, not a duration. The broker can report this without needing to reason about wall-clock time at all. Kafka's consumer_lag metric family is historically in message-count units for the same reason: it's what the broker actually knows.

Wall-clock RPO is production-rate-dependent¶

A 10,000-message lag means:

4 ms at 2.5 M msg/s — the Shadow Linking scale-test result.
40 ms at 250 K msg/s.
10 seconds at 1000 msg/s.

The same underlying replication state produces very different wall-clock RPO depending on the workload. Steady-state RPO requires both the lag number and the current production rate.

Burst traffic breaks the naive time-lag metric¶

If the broker reports "replication lag: 2 seconds", this is the time between the source's latest record's timestamp and the destination's latest record's timestamp. During a burst, both timestamps are close but the message count between them can be huge — the time-lag metric stays low while the actual backlog grows. Message-count lag tracks this correctly: it grows with the backlog regardless of the timestamp gap.

Conversely, during a quiet period, a single old un-replicated message keeps the time-lag high while the actual backlog is one message.

Alert thresholds must be chosen carefully¶

For SLA reporting, wall-clock RPO is what business stakeholders care about. For operational alerting, message-count lag is what the broker natively exposes and what stays stable across throughput changes. Best practice is to alert on message-count lag ("more than N messages behind") and report SLA in wall-clock RPO computed from the message-count lag and the current throughput.

Observability surface (Redpanda Shadowing)¶

The 2026-04-21 post names three observability surfaces that expose replication lag:

Prometheus-compatible metrics. "Prometheus-compatible metrics to see the link status, including replication lag, are published by the broker."
Redpanda Console — interactive UI.
rpk + REST — scripting.

The metric unit is not explicitly named in the post, but the 4 ms / 10,000 msg / 2.5 M msg/s worked example implies message- count is the primitive with RPO derived from it.

Relationship to RPO / RTO¶

RPO is the wall-clock-time view: how much data (measured in time) could be lost at failover.
Replication-lag-in-messages is the broker-native view of the same property: how many messages could be lost.
The two are related through throughput.

A hot-standby DR shape (pattern) whose SLA is "RPO < 5 seconds" translates to different message-count lag targets at different throughputs:

At 2.5 M msg/s: lag must stay under ~12.5 M messages.
At 100 K msg/s: lag must stay under ~500 K messages.
At 1 K msg/s: lag must stay under 5 K messages.

For capacity planning of cross-region bandwidth, the peak message-count lag during burst periods is a more stable planning quantity than the wall-clock RPO at any specific moment.

Contrast with time-based replication lag ¶

The sister concept replication lag is typically measured in wall-clock time (seconds behind primary). Both measurements are useful:

Time-lag is portable across throughputs — 5 seconds means 5 seconds regardless of rate.
Message-count lag is stable within a workload and portable across clusters with different throughput characteristics.

For replication SLAs on streaming systems, reporting both — "lag: 8,000 messages (3.2 ms at current rate)" — is the best practice. The Shadow Linking deep-dive takes this shape implicitly: it reports the message count and the derived wall- clock RPO together.

Seen in¶

sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — canonical wiki source. Canonicalises the lag-in-messages → RPO-in-time computation via the 2.5 GiB/s / 2.5 M msg/s / 10k msg / 4 ms scale-test. First wiki disclosure of per-feature RPO for Redpanda Shadow Linking.

systems/redpanda-shadowing — the canonical wiki instance publishing the measurement.
systems/redpanda — the broker reporting the metric.
systems/kafka — the upstream project whose consumer_lag-family metrics natively use message-count units.
concepts/rpo-rto — the wall-clock-time view message-count lag converts into.
concepts/replication-lag — the time-based sibling measurement.
concepts/offset-preserving-replication — the Shadowing property that makes source and destination offset numbers directly comparable for lag computation.
concepts/kafka-consumer-lag-metric — the Kafka-ecosystem precedent for message-count-lag measurement on the consume side.
concepts/asynchronous-replication — the replication mode this lag measures.
patterns/offset-preserving-async-cross-region-replication — the pattern the measurement describes.
patterns/hot-standby-cluster-for-dr — the DR shape whose RPO SLA derives from this measurement.
companies/redpanda — the company.