Skip to content

CONCEPT Cited by 1 source

Kafka consumer group lag metric

Consumer group lag is the count of un-consumed messages in a Kafka partition: the difference between the log's latest offset (last produced record) and the consumer group's committed offset (last record that a member of the group has processed and acknowledged). It is the canonical foundational observability signal for any streaming pipeline — the answer to "is the consumer keeping up?".

Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — Redpanda 25.1 GA release canonicalises native consumer group lag as a first-class Prometheus metric, replacing a previously documented PromQL compute-from-primitives.

Why it matters

Consumer lag is the earliest signal of three structural problems:

  1. Producer-outpacing-consumer imbalance — the consumer's per- partition throughput is below the producer's rate. Lag grows monotonically until the imbalance is corrected (more consumer CPU, more partitions, faster consumer processing).
  2. Stuck consumers — a consumer-group member is wedged (GC pause, deadlock, permanent downstream block). Its assigned partitions stop making progress; their lag grows unbounded while other partitions in the group remain healthy.
  3. Downstream system back-pressure — the consumer is processing records correctly but a downstream system (DB, cache, third party) is slow; latency accumulates as lag even though throughput is unchanged.

Without a lag metric, SREs observe the symptom (end-to-end latency balloon, freshness SLA miss, downstream data staleness) but cannot distinguish between the root causes — each of which has a different remediation.

Why this needed a GA feature on Redpanda

Before 25.1, Redpanda exposed the primitives — high-watermark offset + committed-offset — and operators computed consumer group lag manually via PromQL:

"With 25.1, Redpanda introduces native consumer group lag metrics, bringing observability in line with what modern Kafka users expect. This feature is a new native (pre-calculated) metric that replaces the previously documented query, rounding out Redpanda's observability story for enterprises that require transparent monitoring of consumer health and throughput." (Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available)

The PromQL-compute approach worked but had three costs:

  • Cost of query-time computation — every Grafana / Datadog dashboard querying lag paid a subtract-high-watermark-minus- committed-offset cost at query time; at scale this became non-trivial PromQL workload.
  • Staleness windows during partition rebalance — committed- offset metrics and high-watermark metrics could be sampled at different instants, producing transient negative lag or inflated lag numbers during rebalance.
  • Consumer-group-membership loss — PromQL-computed lag could not distinguish "consumer group is making no progress because it is caught up" (healthy, lag=0) from "consumer group is making no progress because all members left the group" (broken, lag=∞ but not captured by the offset diff). Native lag metrics canonicalise group membership as part of the signal.

Three derived signals

Given a native lag metric, SRE / platform teams build:

  1. Per-consumer-group lag alerts — alert when lag exceeds a threshold sustained for N minutes. Named use case from the source: "Monitor lag per consumer group."
  2. Stuck-consumer alerts — alert when any partition's lag grows monotonically for M minutes without bounded throughput. Named use case: "Alert on stuck or underperforming consumers."
  3. Lag-spike correlation — overlay lag time series with downstream-processing and ingestion-burst time series to localise the cause. Named use case: "Correlate lag spikes with downstream processing delays or ingestion bursts."

Export surface

The 25.1 native metric is exported as:

Relationship to other lag-family metrics

Lag is part of a broader observability surface on streaming brokers:

Native lag sits alongside these as the consumer-side backbone signal.

Seen in

  • sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available — canonical wiki source. Redpanda 25.1 GA release promotes consumer group lag from a PromQL-computed derived metric to a Prometheus-native first-class metric, naming three operational use cases (monitor, alert, correlate) and three export surfaces (Console, Grafana, Datadog).
Last updated · 470 distilled / 1,213 read