Skip to content

CONCEPT Cited by 1 source

Compaction–replication race

Definition

The compaction–replication race is a correctness bug in Apache Kafka (confirmed on versions 3.9–4.2) where independent per-broker log compaction removes metadata records (tombstones or transaction control batches) before an offline or lagging replica can replicate them, causing permanent replica divergence. The race is architecturally unfixable by parameter tuning — no finite value of delete.retention.ms prevents it if a broker can be offline longer than that value.

Root cause

"When a broker falls behind or goes offline, it drops out of the ISR. Meanwhile, the remaining brokers keep accepting writes and keep compacting as usual. If a critical record (tombstone, COMMIT marker, or ABORT marker) is written while one replica is unavailable — and compaction removes it before the replica catches up — the replica never learns about it." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)

The safeguard is purely time-based: delete.retention.ms (default 24 h) for tombstones, plus producer.id.expiration.ms (default 24 h) for transaction markers. A hardware failure, maintenance window, or slow recovery exceeding these timers creates an unrecoverable divergence.

Four manifestations

All share the same root cause — a replica missing a critical metadata record — but produce different failure modes:

# Variant Lost record Effect
1 Tombstone divergence Tombstone for key K Deleted data reappears on the lagging replica
2 Aborted-to-committed ABORT marker Aborted transaction data served as committed
3 Committed-to-aborted COMMIT marker Committed data reclassified as aborted and hidden
4 Stuck partition COMMIT marker (+ empty-batch remnant) read_committed consumers frozen at stale Last Stable Offset

Why time-based retention fails

Time-based retention (delete.retention.ms) is a heuristic that assumes replicas will always catch up within the retention window. This assumption breaks under:

  • Hardware failures requiring replacement (hours to days)
  • Long maintenance windows (OS upgrades, disk replacements)
  • Slow recovery after network partitions
  • Cascading failures that delay rejoin

Once the retention window passes, the metadata record is eligible for deletion regardless of replica state. The race is then purely a matter of timing between the cleaner thread and the replica's rejoin.

Fix

See concepts/coordinated-compaction and patterns/coordinated-compaction-protocol — Redpanda's solution replaces time-based heuristics with per-replica progress tracking, ensuring no metadata record is removed until every replica has compacted past the data it governs.

Seen in

Last updated · 559 distilled / 1,651 read