Skip to content

PATTERN Cited by 1 source

Dead-letter queue for invalid records

Dead-letter queue for invalid records is a validation pattern where a data pipeline's producer-side validator redirects records that fail validation to a separate queue — the dead-letter queue (DLQ) — instead of dropping them or failing the batch. The DLQ becomes an out-of-band stream of rejected records available for offline re-processing, schema-evolution-aware replay, or root- cause investigation.

Source: sources/2025-04-07-redpanda-251-iceberg-topics-now-generally-available canonicalises this as a broker-level feature on Redpanda Iceberg Topics at GA:

"Built-in dead-letter queues to redirect and re-process invalid records, improving data quality, reliability, and end-user trust in data." (Source)

Problem this solves

A streaming pipeline that projects records into a schema-enforced downstream format (typed Iceberg table, relational database, typed Parquet files) must decide what to do with records that violate the schema:

  • Record is malformed — wrong type for a column, unparseable bytes, truncated payload.
  • Schema mismatch — the producer is still writing v1 records but the downstream schema is v2 with incompatible changes.
  • Value constraint violation — record's values violate a check constraint (negative ID, out-of-range timestamp).

Three naive strategies all have structural problems:

  1. Drop the bad record silently. Loses data; no signal to operators; schema bug propagates unnoticed.
  2. Fail the entire batch. Blocks the good records behind the bad one; production incident for a transient data-quality issue.
  3. Auto-fix (coerce types, null out unknown fields). Corrupts the downstream data model in a way that's impossible to recover from (you can't distinguish "value was null in source" from "value was coerced from garbage").

The DLQ answer

Route the bad record to a separate queue. The good records continue flowing; the bad records accumulate in a dedicated place where operators can:

  • Inspect them to understand the data-quality issue (is it one buggy producer, a schema-version skew, or a semantic bug?).
  • Replay them after fixing the schema or the producer (many DLQ messages are recoverable once a downstream schema update lands).
  • Quantify the problem via DLQ-depth / DLQ-rate metrics — a feedback signal that the upstream schema contract is being violated.

Where the validator lives

There are three structural places to put the validator:

  1. Producer-side — the producing application validates before writing to the topic. Pros: catches bad data before it touches the broker. Cons: duplicated across every producer; no broker-level data-quality guarantee.
  2. Consumer-side — each consumer validates as it reads. Pros: no broker changes. Cons: every consumer reimplements validation; bad data lives in the topic forever.
  3. Broker-side / platform-managed — the broker validates on write using a schema registry or schema definition (e.g. the Iceberg table schema). Pros: single point of enforcement; DLQ is a platform primitive. Cons: broker does more work; schema must be known to the broker.

The Redpanda 25.1 Iceberg Topics case is the third shape — broker validates against the Iceberg table schema during the row-to- Parquet projection, and the DLQ is a standard Kafka topic the operator can configure.

DLQ as a first-class operational surface

A well-designed DLQ pattern includes:

  • Replay tooling — one-click re-publish from DLQ back to primary topic after root cause is fixed.
  • DLQ-depth monitoring — alert when DLQ grows unexpectedly fast (signal of upstream breakage).
  • DLQ TTL / retention — bounded retention so DLQ doesn't grow unboundedly; retention must be longer than the schema-evolution / upstream-fix SLA.
  • Error metadata — each DLQ record carries the rejection reason (field name, schema version, specific constraint) so operators can triage without re-validating.

The Redpanda post asserts the DLQ primitive but does not specify the metadata shape, retention default, or replay tooling — operational specifics deferred to the product documentation.

Trade-offs

  • DLQ itself has a schema problem. DLQ records need some schema for tooling to read them — typically an envelope schema (original payload bytes + rejection reason + source topic + original offset). Envelope schemas evolve too, at which point the DLQ's DLQ becomes a real question.
  • DLQ depth is a lagging indicator. By the time operators notice a bad producer via DLQ depth alerts, many bad records may already be rejected. For contract-breaking schema changes upstream, stage-gated deploys + schema registry enforcement (prevention) is stronger than DLQ (detection).
  • Replay ordering hazard. Records replayed from DLQ after a schema fix re-enter the topic at a later offset than the original — downstream consumers that care about record ordering within a partition will see the replayed records as "arrived late", which may violate application-level invariants.

Seen in

Last updated · 470 distilled / 1,213 read