Skip to content

REDPANDA 2026-06-25

Read original ↗

Kafka's log compaction corrupts data. Here's how we fixed it

Summary

Redpanda discloses a correctness bug in Apache Kafka's log compaction (reproducible on Kafka 3.9–4.2) where independent per-broker compaction creates a race with replication: if a tombstone or transaction control batch is written while a replica is offline and then compacted away before that replica catches up, the replica permanently disagrees with the leader about what data exists. The post describes four manifestations of the bug (deleted data reappears, aborted data served as committed, committed data hidden, partition frozen) and presents Redpanda's fix: a coordinated compaction protocol that uses per-replica progress watermarks (MCCO/MTRO/MXFO/MXRO) to ensure no metadata record is removed until every replica has compacted past the data it governs.

Key takeaways

  1. Kafka's log compaction is per-broker and uncoordinated — each broker compacts its own log independently. The only safeguard against premature deletion is delete.retention.ms (default 24 h), a pure time-based heuristic with no awareness of replica state (Source: raw article, "How Kafka log compaction works" section).

  2. Four distinct data-corruption manifestations from the same root cause (compaction–replication race): (a) tombstone divergence — deleted data reappears; (b) aborted-to-committed — aborted transaction data served as valid; (c) committed-to-aborted — committed data reclassified and hidden; (d) stuck partition — read_committed consumers frozen at a stale Last Stable Offset (Source: raw article, Issues 1–4).

  3. The bug reproduces reliably on Kafka 3.9 through 4.2 — Redpanda provides a public Docker Compose reproducer in a companion GitHub repo (Source: raw article, "Reproducing the bug step-by-step" section).

  4. Root cause: time-based retention cannot guard against unbounded replica unavailability — a broker offline longer than delete.retention.ms (hardware failure, long maintenance, slow recovery) will miss both the critical record and its empty-batch remnant, with no mechanism to recover (Source: raw article, "The root cause" section).

  5. Coordinated compaction introduces two per-replica watermarks for tombstone safety: MCCO (maximum cleanly compacted offset) — per-replica; MTRO (maximum tombstone removal offset) — per-replica-set, computed as min(all MCCOs). A tombstone is safe to remove only below MTRO (Source: raw article, "The protocol for tombstone removal" section).

  6. A parallel pair for transaction markers: MXFO (maximum transaction-free offset) — per-replica; MXRO (maximum transaction-marker removal offset) — per-replica-set, min(all MXFOs). COMMIT/ABORT markers are safe to remove only below MXRO (Source: raw article, "The protocol for transaction marker removal" section).

  7. Protocol operates in two phases — collection (leader asks followers for MCCO/MXFO) and distribution (leader computes MTRO/MXRO and pushes back to all replicas) (Source: raw article, "The protocol for tombstone removal" section).

  8. Correctness is a guarantee, compaction is best-effort — if a replica stays offline, MTRO/MXRO do not advance, pausing cleanup clusterwide. Storage accumulates but data safety is never compromised. Once the replica returns and compacts, cleanup resumes (Source: raw article, "Data safety comes first" section).

  9. MTRO/MXRO never go backward — once a cleanup decision is made it's permanent; late RPCs from previous leaders are ignored (Source: raw article, "Handling edge cases" section).

  10. Edge cases handled: leadership changes (new leader re-broadcasts from existing MTRO); membership changes (new replica MCCO initialised to group MTRO; removed replica's MCCO dropped, potentially advancing MTRO) (Source: raw article, "Handling edge cases" section).

Operational numbers

  • delete.retention.ms default: 24 hours
  • producer.id.expiration.ms default: 24 hours
  • Bug reproduces on Kafka 3.9–4.2 (all tested versions)
  • Reproducer available with aggressive settings completing in ~10 minutes

Architectural significance

This is the first public disclosure of a multi-variant data-corruption bug in Kafka's compacted-topic + transactional-write interaction that is architecturally unfixable by parameter tuning alone — no finite value of delete.retention.ms can prevent it if a broker can be offline for longer than that value. The fix requires a protocol-level change: coordination between replicas about compaction progress before deletion decisions are made. The coordinated compaction protocol is structurally analogous to distributed garbage collection with leader-driven watermark aggregation — the same architectural shape as Redpanda's L0 GC epoch-based protocol for Cloud Topics, applied at the log-segment compaction altitude.

Caveats

  • Vendor-voice article from Redpanda — the fix is Redpanda's implementation, not an upstream Kafka patch. Whether Kafka will adopt a similar fix is not stated.
  • No performance impact numbers for coordinated compaction (additional RPCs per compaction cycle, storage overhead during slow-replica scenarios).
  • The four-variant bug description focuses on the 3-broker scenario; behavior at larger replication factors not explicitly discussed.
  • Reproducer requires Docker Compose; no unit-test-level reproducer for CI integration.

Source

Last updated · 559 distilled / 1,651 read