CONCEPT Cited by 1 source
Compaction–replication race¶
Definition¶
The compaction–replication race is a correctness bug in Apache Kafka (confirmed on versions 3.9–4.2) where independent per-broker log compaction removes metadata records (tombstones or transaction control batches) before an offline or lagging replica can replicate them, causing permanent replica divergence. The race is architecturally unfixable by parameter tuning — no finite value of delete.retention.ms prevents it if a broker can be offline longer than that value.
Root cause¶
"When a broker falls behind or goes offline, it drops out of the ISR. Meanwhile, the remaining brokers keep accepting writes and keep compacting as usual. If a critical record (tombstone, COMMIT marker, or ABORT marker) is written while one replica is unavailable — and compaction removes it before the replica catches up — the replica never learns about it." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)
The safeguard is purely time-based: delete.retention.ms (default 24 h) for tombstones, plus producer.id.expiration.ms (default 24 h) for transaction markers. A hardware failure, maintenance window, or slow recovery exceeding these timers creates an unrecoverable divergence.
Four manifestations¶
All share the same root cause — a replica missing a critical metadata record — but produce different failure modes:
| # | Variant | Lost record | Effect |
|---|---|---|---|
| 1 | Tombstone divergence | Tombstone for key K | Deleted data reappears on the lagging replica |
| 2 | Aborted-to-committed | ABORT marker | Aborted transaction data served as committed |
| 3 | Committed-to-aborted | COMMIT marker | Committed data reclassified as aborted and hidden |
| 4 | Stuck partition | COMMIT marker (+ empty-batch remnant) | read_committed consumers frozen at stale Last Stable Offset |
Why time-based retention fails¶
Time-based retention (delete.retention.ms) is a heuristic that assumes replicas will always catch up within the retention window. This assumption breaks under:
- Hardware failures requiring replacement (hours to days)
- Long maintenance windows (OS upgrades, disk replacements)
- Slow recovery after network partitions
- Cascading failures that delay rejoin
Once the retention window passes, the metadata record is eligible for deletion regardless of replica state. The race is then purely a matter of timing between the cleaner thread and the replica's rejoin.
Fix¶
See concepts/coordinated-compaction and patterns/coordinated-compaction-protocol — Redpanda's solution replaces time-based heuristics with per-replica progress tracking, ensuring no metadata record is removed until every replica has compacted past the data it governs.
Seen in¶
- sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data — canonical disclosure with public reproducer scripts for all four variants.
Related¶
- concepts/log-compaction — the mechanism in which the race occurs
- concepts/in-sync-replica-set — replica unavailability is the trigger
- concepts/tombstone — record type causing Issue 1
- concepts/transaction-control-batch — record type causing Issues 2–4
- concepts/replica-divergence — the resulting failure mode
- concepts/coordinated-compaction — Redpanda's fix
- concepts/delete-retention-ms — the time-based safeguard that proves insufficient
- systems/kafka