Skip to content

CONCEPT Cited by 1 source

Coordinated compaction

Definition

Coordinated compaction is Redpanda's protocol-level fix for the compaction–replication race in log-compacted topics. Instead of each broker making independent, time-based decisions about when to delete tombstones and transaction control batches, replicas coordinate via a leader-driven watermark protocol to ensure no metadata record is removed until every replica has compacted past the data it governs.

"To behave correctly even during prolonged broker outages or slowness, Redpanda runs a small coordination protocol on top of compaction that keeps a tombstone or control marker in place until every replica has compacted the associated data records." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)

Design principle

"Correctness is a guarantee, compaction is best-effort." (Source: sources/2026-06-25-redpanda-kafkas-log-compaction-corrupts-data)

If a replica stays offline indefinitely, MTRO/MXRO do not advance — cleanup pauses clusterwide. Storage accumulates but data safety is never compromised. Once the replica rejoins and compacts, cleanup resumes automatically.

Watermarks

For tombstone removal

Watermark Scope Definition
MCCO (Maximum Cleanly Compacted Offset) Per-replica Offset up to which this replica's log has been cleanly compacted — no duplicate keys below this point
MTRO (Maximum Tombstone Removal Offset) Per-replica-set min(all MCCOs) — the offset below which tombstones are safe to remove on any replica

For transaction marker removal

Watermark Scope Definition
MXFO (Maximum Transaction-Free Offset) Per-replica Offset up to which all transactions are fully resolved (committed or aborted)
MXRO (Maximum Transaction-Marker Removal Offset) Per-replica-set min(all MXFOs) — the offset below which COMMIT/ABORT markers are safe to remove

Protocol phases

  1. Collection — the partition leader periodically asks each follower: "What's your MCCO/MXFO?"
  2. Distribution — the leader computes MTRO = min(all MCCOs) and MXRO = min(all MXFOs), then pushes these values back to every replica.

Invariants

  • MTRO/MXRO never go backward — once a cleanup decision is made, it's permanent. Late RPCs from previous leaders are ignored.
  • MCCO/MXFO only move forward — once data is cleanly compacted, it stays compacted.
  • Offline replicas freeze progress — their last-known MCCO/MXFO is used in the min computation, preventing MTRO/MXRO from advancing past them.

Edge cases

  • Leadership changes: new leader uses existing MTRO as starting point, collects fresh MCCOs, re-broadcasts even if value unchanged (followers may have missed the last update).
  • Replica added: MCCO initialised to group's current MTRO (correct because log will be received from a replica already compacted to that point).
  • Replica removed: its MCCO drops from the min computation, potentially advancing MTRO.

Architectural analogy

Structurally similar to epoch-based distributed GC used in Redpanda's Cloud Topics L0 garbage collection — both use leader-driven aggregation of per-shard monotonic watermarks to determine safe-to-delete thresholds, with staleness always being conservative-safe due to monotonicity.

Seen in

Last updated · 559 distilled / 1,651 read