Skip to content

REDPANDA 2026-06-30

Read original ↗

How Redpanda Cloud Topics rethinks Kafka compaction

Summary

This post explains how Redpanda Cloud Topics fundamentally rethinks Kafka log compaction by decoupling it from broker-local replicas. In traditional disk-based Kafka, every broker independently compacts its own replica — burning redundant CPU, competing with produce/consume workloads, and introducing a coordination problem for tombstone removal. Cloud Topics eliminates all three issues: compaction operates once against the canonical copy in object storage, any broker (or any shard on any node) can run it, and a pull-based scheduling model prevents compaction starvation on overloaded shards.

Key takeaways

  1. Kafka compaction is a two-pass algorithm — scan the log to build a key→latest-offset map, then rewrite the log keeping only the latest record per key. The real challenge is making this scale: handling partitions with unbounded key counts and not starving the broker of CPU.

  2. SHA-256 hashing fixes the memory problem — Redpanda's key-offset map uses SHA-256 hashes (32 bytes) + 8-byte offsets = fixed 40 bytes per entry. With a default 128 MiB allocation, this fits ~3.3 million unique keys per pass.

  3. Backward scanning of dirty ranges guarantees that the first indexed offset for any key is the latest one, enabling early termination when the map is full and convergence to full deduplication over finite compactions.

  4. Redundant compaction is the fundamental problem with disk-based Kafka — with replication.factor=3, the same logical data is compacted 3 times independently on 3 brokers, wasting CPU and creating a tombstone coordination race.

  5. Cloud Topics decouples compaction from the broker — data lives once in object storage; compaction runs once against the canonical copy. This eliminates redundant work, frees broker CPU for produce/consume, and removes the tombstone coordination problem entirely.

  6. Any shard on any node can be a compactor — because compaction operates on object-store data + metastore metadata rather than local replica state, work can be distributed freely across the cluster, independent of partition leadership.

  7. Pull-based scheduling prevents starvation — a scheduler (on shard 0 of each compaction node) maintains a priority queue based on dirty ratio (min.cleanable.dirty.ratio) and time since eligibility (max.compaction.lag.ms). Workers on every shard poll for work, decoupling compaction scheduling from partition placement.

  8. Multi-part uploads cap memory usage — compacted L1 objects are uploaded via multi-part upload, scaling memory with part size rather than total object size. No spill to disk required.

  9. Optimistic concurrency via compaction_epoch — the metastore tracks a per-partition integer compaction_epoch incremented on each successful update. If two nodes compact the same log concurrently, the stale writer's update is rejected — preventing reintroduction of stale data.

Operational numbers

Parameter Value
Key-offset map entry size 40 bytes (32-byte SHA-256 + 8-byte offset)
Default key-offset map allocation 128 MiB
Keys per compaction pass ~3.3 million

Architecture: Cloud Topics compaction vs traditional Kafka

Dimension Traditional Kafka Cloud Topics
Compaction copies RF independent compactions 1 (canonical object store copy)
Where it runs Each broker, competing with produce/consume Any shard on any node
Scheduling Coupled to partition leader placement Pull-based priority queue; decoupled from placement
Tombstone coordination Requires cross-broker protocol Eliminated — single copy, no divergence possible
Memory scaling Per-object size Per-part size (multi-part upload)
Concurrency control N/A (each broker owns its replica) Optimistic — compaction_epoch in metastore

Caveats

  • The post does not provide throughput benchmarks for Cloud Topics compaction (bytes/s, latency to convergence).
  • compaction_epoch conflict resolution is optimistic — under heavy concurrent compaction, wasted work is possible (though not a correctness issue).
  • The scheduling heuristic (dirty ratio + lag) may still allow starvation in extreme edge cases (many large partitions, limited cluster CPU).

Source

Last updated · 567 distilled / 1,685 read