CONCEPT Cited by 1 source
Replication lag in message count¶
Definition¶
Replication lag in message count is the measurement dimension where cross-cluster replication lag is reported as the number of source-cluster messages not yet replicated to the destination cluster — rather than (or alongside) wall-clock time lag. Converting to wall-clock RPO is a division:
Message-count lag is the native unit inside the broker: replication tasks see offsets and message counts, not wall-clocks. Wall-clock RPO is a derived quantity that requires the current production rate to compute.
Canonical wiki source¶
Canonicalised by the 2026-04-21 Redpanda Shadow Linking deep-dive:
"I recently scale-tested shadowing, driving the source cluster at 2.5 GiB/s. During that experiment, I was able to replicate with a total lag (across all topics) that was consistently lower than 10,000 messages — on a workload producing 2.5 million messages per second — giving us an effective RPO of around 4 milliseconds on average."
The post disclosure is the canonical wiki lag-as-messages → RPO-as-time computation: 10,000 msg / 2,500,000 msg/s = 4 ms.
Why the unit matters¶
Message-count is the native broker quantity¶
Inside the broker, every record has a monotonic offset. Lag is
the offset delta between source's latest and destination's
latest replicated offset — a count, not a duration. The broker
can report this without needing to reason about wall-clock time
at all. Kafka's consumer_lag metric family is historically in
message-count units for the same reason: it's what the broker
actually knows.
Wall-clock RPO is production-rate-dependent¶
A 10,000-message lag means:
- 4 ms at 2.5 M msg/s — the Shadow Linking scale-test result.
- 40 ms at 250 K msg/s.
- 10 seconds at 1000 msg/s.
The same underlying replication state produces very different wall-clock RPO depending on the workload. Steady-state RPO requires both the lag number and the current production rate.
Burst traffic breaks the naive time-lag metric¶
If the broker reports "replication lag: 2 seconds", this is the time between the source's latest record's timestamp and the destination's latest record's timestamp. During a burst, both timestamps are close but the message count between them can be huge — the time-lag metric stays low while the actual backlog grows. Message-count lag tracks this correctly: it grows with the backlog regardless of the timestamp gap.
Conversely, during a quiet period, a single old un-replicated message keeps the time-lag high while the actual backlog is one message.
Alert thresholds must be chosen carefully¶
For SLA reporting, wall-clock RPO is what business stakeholders care about. For operational alerting, message-count lag is what the broker natively exposes and what stays stable across throughput changes. Best practice is to alert on message-count lag ("more than N messages behind") and report SLA in wall-clock RPO computed from the message-count lag and the current throughput.
Observability surface (Redpanda Shadowing)¶
The 2026-04-21 post names three observability surfaces that expose replication lag:
- Prometheus-compatible metrics. "Prometheus-compatible metrics to see the link status, including replication lag, are published by the broker."
- Redpanda Console — interactive UI.
- rpk + REST — scripting.
The metric unit is not explicitly named in the post, but the 4 ms / 10,000 msg / 2.5 M msg/s worked example implies message- count is the primitive with RPO derived from it.
Relationship to RPO / RTO¶
- RPO is the wall-clock-time view: how much data (measured in time) could be lost at failover.
- Replication-lag-in-messages is the broker-native view of the same property: how many messages could be lost.
- The two are related through throughput.
A hot-standby DR shape (pattern) whose SLA is "RPO < 5 seconds" translates to different message-count lag targets at different throughputs:
- At 2.5 M msg/s: lag must stay under ~12.5 M messages.
- At 100 K msg/s: lag must stay under ~500 K messages.
- At 1 K msg/s: lag must stay under 5 K messages.
For capacity planning of cross-region bandwidth, the peak message-count lag during burst periods is a more stable planning quantity than the wall-clock RPO at any specific moment.
Contrast with time-based replication lag¶
The sister concept replication lag is typically measured in wall-clock time (seconds behind primary). Both measurements are useful:
- Time-lag is portable across throughputs — 5 seconds means 5 seconds regardless of rate.
- Message-count lag is stable within a workload and portable across clusters with different throughput characteristics.
For replication SLAs on streaming systems, reporting both — "lag: 8,000 messages (3.2 ms at current rate)" — is the best practice. The Shadow Linking deep-dive takes this shape implicitly: it reports the message count and the derived wall- clock RPO together.
Seen in¶
- sources/2026-04-21-redpanda-me-and-my-shadow-link-disaster-recovery-replication-made-easy — canonical wiki source. Canonicalises the lag-in-messages → RPO-in-time computation via the 2.5 GiB/s / 2.5 M msg/s / 10k msg / 4 ms scale-test. First wiki disclosure of per-feature RPO for Redpanda Shadow Linking.
Related¶
- systems/redpanda-shadowing — the canonical wiki instance publishing the measurement.
- systems/redpanda — the broker reporting the metric.
- systems/kafka — the upstream project whose
consumer_lag-family metrics natively use message-count units. - concepts/rpo-rto — the wall-clock-time view message-count lag converts into.
- concepts/replication-lag — the time-based sibling measurement.
- concepts/offset-preserving-replication — the Shadowing property that makes source and destination offset numbers directly comparable for lag computation.
- concepts/kafka-consumer-lag-metric — the Kafka-ecosystem precedent for message-count-lag measurement on the consume side.
- concepts/asynchronous-replication — the replication mode this lag measures.
- patterns/offset-preserving-async-cross-region-replication — the pattern the measurement describes.
- patterns/hot-standby-cluster-for-dr — the DR shape whose RPO SLA derives from this measurement.
- companies/redpanda — the company.