CONCEPT Cited by 1 source
Kafka consumer group lag metric¶
Consumer group lag is the count of un-consumed messages in a Kafka partition: the difference between the log's latest offset (last produced record) and the consumer group's committed offset (last record that a member of the group has processed and acknowledged). It is the canonical foundational observability signal for any streaming pipeline — the answer to "is the consumer keeping up?".
Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — Redpanda 25.1 GA release canonicalises native consumer group lag as a first-class Prometheus metric, replacing a previously documented PromQL compute-from-primitives.
Why it matters¶
Consumer lag is the earliest signal of three structural problems:
- Producer-outpacing-consumer imbalance — the consumer's per- partition throughput is below the producer's rate. Lag grows monotonically until the imbalance is corrected (more consumer CPU, more partitions, faster consumer processing).
- Stuck consumers — a consumer-group member is wedged (GC pause, deadlock, permanent downstream block). Its assigned partitions stop making progress; their lag grows unbounded while other partitions in the group remain healthy.
- Downstream system back-pressure — the consumer is processing records correctly but a downstream system (DB, cache, third party) is slow; latency accumulates as lag even though throughput is unchanged.
Without a lag metric, SREs observe the symptom (end-to-end latency balloon, freshness SLA miss, downstream data staleness) but cannot distinguish between the root causes — each of which has a different remediation.
Why this needed a GA feature on Redpanda¶
Before 25.1, Redpanda exposed the primitives — high-watermark offset + committed-offset — and operators computed consumer group lag manually via PromQL:
"With 25.1, Redpanda introduces native consumer group lag metrics, bringing observability in line with what modern Kafka users expect. This feature is a new native (pre-calculated) metric that replaces the previously documented query, rounding out Redpanda's observability story for enterprises that require transparent monitoring of consumer health and throughput." (Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available)
The PromQL-compute approach worked but had three costs:
- Cost of query-time computation — every Grafana / Datadog dashboard querying lag paid a subtract-high-watermark-minus- committed-offset cost at query time; at scale this became non-trivial PromQL workload.
- Staleness windows during partition rebalance — committed- offset metrics and high-watermark metrics could be sampled at different instants, producing transient negative lag or inflated lag numbers during rebalance.
- Consumer-group-membership loss — PromQL-computed lag could not distinguish "consumer group is making no progress because it is caught up" (healthy, lag=0) from "consumer group is making no progress because all members left the group" (broken, lag=∞ but not captured by the offset diff). Native lag metrics canonicalise group membership as part of the signal.
Three derived signals¶
Given a native lag metric, SRE / platform teams build:
- Per-consumer-group lag alerts — alert when lag exceeds a threshold sustained for N minutes. Named use case from the source: "Monitor lag per consumer group."
- Stuck-consumer alerts — alert when any partition's lag grows monotonically for M minutes without bounded throughput. Named use case: "Alert on stuck or underperforming consumers."
- Lag-spike correlation — overlay lag time series with downstream-processing and ingestion-burst time series to localise the cause. Named use case: "Correlate lag spikes with downstream processing delays or ingestion bursts."
Export surface¶
The 25.1 native metric is exported as:
- Prometheus — canonical scrape endpoint.
- Visible in Grafana, Datadog, and Redpanda Console (the operator UI).
Relationship to other lag-family metrics¶
Lag is part of a broader observability surface on streaming brokers:
- broker effective batch size observability — a producer-side signal (are batches amortising the per-request cost).
- Kafka consumer backlog — a related concept; generally a synonym for consumer group lag in upstream Kafka vocabulary.
Native lag sits alongside these as the consumer-side backbone signal.
Seen in¶
- sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — canonical wiki source. Redpanda 25.1 GA release promotes consumer group lag from a PromQL-computed derived metric to a Prometheus-native first-class metric, naming three operational use cases (monitor, alert, correlate) and three export surfaces (Console, Grafana, Datadog).
Related¶
- systems/kafka — the protocol the metric derives from.
- systems/redpanda — the 25.1 release that canonicalises the native metric.
- concepts/observability — foundational framing.
- concepts/kafka-consumer-backlog — sibling concept on the same axis.
- concepts/broker-effective-batch-size-observability — the producer-side counterpart signal; together they provide two-sided streaming-pipeline visibility.
- systems/prometheus · systems/grafana · systems/datadog — canonical export surfaces.