CONCEPT Cited by 1 source
Log compaction¶
Definition¶
Log compaction is a retention strategy in append-only distributed logs (canonically Apache Kafka) where the broker retains only the latest value for each key in a topic, rather than retaining records by time or size. Compaction transforms an unbounded event stream into a bounded key-value changelog — the compacted log represents the latest state of every key ever written.
Mechanism¶
In Kafka, a background log cleaner thread periodically scans closed log segments and removes older records that have been superseded by a newer record with the same key. Two special record types interact with compaction:
-
Tombstones — records with key K and null value. They signal deletion of key K. After compaction removes all prior values for K, the tombstone itself becomes removable after
delete.retention.ms(default 24 h). -
Transaction control batches — COMMIT/ABORT markers appended by transactional producers. These are subject to the same expiration-based cleanup: once resolved data is compacted away, the markers become removable.
The compaction–replication race (Kafka 3.9–4.2)¶
In Kafka, each broker compacts its own replica independently — there is no coordination between replicas about compaction progress. This creates a race condition: if a critical metadata record (tombstone or control batch) is written while a replica is offline and compacted away before that replica catches up, the replica permanently diverges from the leader.
"Tombstones and COMMIT/ABORT control batches are the only signals that their associated records were deleted, committed, or aborted, respectively. Once a tombstone or a control batch is compacted away, this information is gone." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)
See concepts/compaction-replication-race for the full four-variant bug taxonomy and concepts/coordinated-compaction for Redpanda's fix.
Seen in¶
- sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data — canonical source disclosing Kafka's uncoordinated compaction as a data-corruption vector. Four failure modes demonstrated: deleted data reappears, aborted data served as committed, committed data hidden, partition frozen for
read_committedconsumers. Redpanda's coordinated compaction protocol introduced as the fix.
Related¶
- concepts/tombstone — the deletion marker whose premature removal causes data resurrection
- concepts/transaction-control-batch — COMMIT/ABORT markers subject to the same race
- concepts/compaction-replication-race — the specific race condition in Kafka
- concepts/coordinated-compaction — Redpanda's protocol-level fix
- concepts/in-sync-replica-set — replica unavailability triggers the race
- concepts/lsm-compaction — analogous compaction at the storage-engine level
- systems/kafka
- systems/redpanda