CONCEPT Cited by 1 source

Compaction–replication race¶

Definition¶

The compaction–replication race is a correctness bug in Apache Kafka (confirmed on versions 3.9–4.2) where independent per-broker log compaction removes metadata records (tombstones or transaction control batches) before an offline or lagging replica can replicate them, causing permanent replica divergence. The race is architecturally unfixable by parameter tuning — no finite value of delete.retention.ms prevents it if a broker can be offline longer than that value.

Root cause¶

"When a broker falls behind or goes offline, it drops out of the ISR. Meanwhile, the remaining brokers keep accepting writes and keep compacting as usual. If a critical record (tombstone, COMMIT marker, or ABORT marker) is written while one replica is unavailable — and compaction removes it before the replica catches up — the replica never learns about it." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)

The safeguard is purely time-based: delete.retention.ms (default 24 h) for tombstones, plus producer.id.expiration.ms (default 24 h) for transaction markers. A hardware failure, maintenance window, or slow recovery exceeding these timers creates an unrecoverable divergence.

Four manifestations¶

All share the same root cause — a replica missing a critical metadata record — but produce different failure modes:

#	Variant	Lost record	Effect
1	Tombstone divergence	Tombstone for key K	Deleted data reappears on the lagging replica
2	Aborted-to-committed	ABORT marker	Aborted transaction data served as committed
3	Committed-to-aborted	COMMIT marker	Committed data reclassified as aborted and hidden
4	Stuck partition	COMMIT marker (+ empty-batch remnant)	`read_committed` consumers frozen at stale Last Stable Offset

Why time-based retention fails¶

Time-based retention (delete.retention.ms) is a heuristic that assumes replicas will always catch up within the retention window. This assumption breaks under:

Hardware failures requiring replacement (hours to days)
Long maintenance windows (OS upgrades, disk replacements)
Slow recovery after network partitions
Cascading failures that delay rejoin

Once the retention window passes, the metadata record is eligible for deletion regardless of replica state. The race is then purely a matter of timing between the cleaner thread and the replica's rejoin.

Fix¶

See concepts/coordinated-compaction and patterns/coordinated-compaction-protocol — Redpanda's solution replaces time-based heuristics with per-replica progress tracking, ensuring no metadata record is removed until every replica has compacted past the data it governs.

Seen in¶

sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data — canonical disclosure with public reproducer scripts for all four variants.

concepts/log-compaction — the mechanism in which the race occurs
concepts/in-sync-replica-set — replica unavailability is the trigger
concepts/tombstone — record type causing Issue 1
concepts/transaction-control-batch — record type causing Issues 2–4
concepts/replica-divergence — the resulting failure mode
concepts/coordinated-compaction — Redpanda's fix
concepts/delete-retention-ms — the time-based safeguard that proves insufficient
systems/kafka