CONCEPT Cited by 1 source

Kafka consumer group lag metric¶

Consumer group lag is the count of un-consumed messages in a Kafka partition: the difference between the log's latest offset (last produced record) and the consumer group's committed offset (last record that a member of the group has processed and acknowledged). It is the canonical foundational observability signal for any streaming pipeline — the answer to "is the consumer keeping up?".

Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — Redpanda 25.1 GA release canonicalises native consumer group lag as a first-class Prometheus metric, replacing a previously documented PromQL compute-from-primitives.

Why it matters¶

Consumer lag is the earliest signal of three structural problems:

Producer-outpacing-consumer imbalance — the consumer's per- partition throughput is below the producer's rate. Lag grows monotonically until the imbalance is corrected (more consumer CPU, more partitions, faster consumer processing).
Stuck consumers — a consumer-group member is wedged (GC pause, deadlock, permanent downstream block). Its assigned partitions stop making progress; their lag grows unbounded while other partitions in the group remain healthy.
Downstream system back-pressure — the consumer is processing records correctly but a downstream system (DB, cache, third party) is slow; latency accumulates as lag even though throughput is unchanged.

Without a lag metric, SREs observe the symptom (end-to-end latency balloon, freshness SLA miss, downstream data staleness) but cannot distinguish between the root causes — each of which has a different remediation.

Why this needed a GA feature on Redpanda¶

Before 25.1, Redpanda exposed the primitives — high-watermark offset + committed-offset — and operators computed consumer group lag manually via PromQL:

"With 25.1, Redpanda introduces native consumer group lag metrics, bringing observability in line with what modern Kafka users expect. This feature is a new native (pre-calculated) metric that replaces the previously documented query, rounding out Redpanda's observability story for enterprises that require transparent monitoring of consumer health and throughput." (Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available)

The PromQL-compute approach worked but had three costs:

Cost of query-time computation — every Grafana / Datadog dashboard querying lag paid a subtract-high-watermark-minus- committed-offset cost at query time; at scale this became non-trivial PromQL workload.
Staleness windows during partition rebalance — committed- offset metrics and high-watermark metrics could be sampled at different instants, producing transient negative lag or inflated lag numbers during rebalance.
Consumer-group-membership loss — PromQL-computed lag could not distinguish "consumer group is making no progress because it is caught up" (healthy, lag=0) from "consumer group is making no progress because all members left the group" (broken, lag=∞ but not captured by the offset diff). Native lag metrics canonicalise group membership as part of the signal.

Three derived signals¶

Given a native lag metric, SRE / platform teams build:

Per-consumer-group lag alerts — alert when lag exceeds a threshold sustained for N minutes. Named use case from the source: "Monitor lag per consumer group."
Stuck-consumer alerts — alert when any partition's lag grows monotonically for M minutes without bounded throughput. Named use case: "Alert on stuck or underperforming consumers."
Lag-spike correlation — overlay lag time series with downstream-processing and ingestion-burst time series to localise the cause. Named use case: "Correlate lag spikes with downstream processing delays or ingestion bursts."

Export surface¶

The 25.1 native metric is exported as:

Prometheus — canonical scrape endpoint.
Visible in Grafana, Datadog, and Redpanda Console (the operator UI).

Relationship to other lag-family metrics¶

Lag is part of a broader observability surface on streaming brokers:

broker effective batch size observability — a producer-side signal (are batches amortising the per-request cost).
Kafka consumer backlog — a related concept; generally a synonym for consumer group lag in upstream Kafka vocabulary.

Native lag sits alongside these as the consumer-side backbone signal.

Seen in¶

sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — canonical wiki source. Redpanda 25.1 GA release promotes consumer group lag from a PromQL-computed derived metric to a Prometheus-native first-class metric, naming three operational use cases (monitor, alert, correlate) and three export surfaces (Console, Grafana, Datadog).

systems/kafka — the protocol the metric derives from.
systems/redpanda — the 25.1 release that canonicalises the native metric.
concepts/observability — foundational framing.
concepts/kafka-consumer-backlog — sibling concept on the same axis.
concepts/broker-effective-batch-size-observability — the producer-side counterpart signal; together they provide two-sided streaming-pipeline visibility.
systems/prometheus · systems/grafana · systems/datadog — canonical export surfaces.