Skip to content

Redpanda — Cloud Topics: Level Zero garbage collection

Summary

Redpanda's 2026-05-19 post is Part 1 of 2 on how Cloud Topics decides when an L0 object — the temporary, mixed-partition object-storage file produced by the Cloud Topics write path — is safe to delete. The post is structured as a guided rejection of the obvious solution (distributed reference counting) followed by the disclosed mechanism: every L0 object is stamped at creation with a monotonically increasing cluster epoch, each Cloud Topic partition tracks a sliding window of active epochs in a replicated state machine embedded in its Raft log, and a clusterwide safe-to-GC epoch M is constructed lazily by taking the minimum of per-partition M(p) values disseminated by an existing periodic metadata-dissemination service. The architectural property the post repeatedly emphasises is "no central index, no shared state, and no coordinated updates" — Cloud Topics' L0 GC is coordination-free by virtue of three composed primitives: (1) embedding a cluster-global monotonic stamp on each object, (2) maintaining per-partition GC state as a Raft-replicated state machine, and (3) computing the safe-to-delete epoch as a global minimum over already-disseminated local state, with monotonicity giving stale-data tolerance.

This is the first wiki source to canonicalise epoch-based distributed garbage collection as a first-class technique distinct from reference counting. It promotes the cluster-epoch primitive, the sliding-window relaxation, and the lazy-aggregation-from- monotonic-local-state shape into named wiki concepts and patterns. Part 2 of the series — the actual deletion mechanics, "how the garbage collector's design enables us to continually delete thousands of L0 objects without any locally persistent state, explicit coordination, or wasted work" — is forward-referenced but not yet ingested.

Key takeaways

  1. L0 garbage collection is the timing problem of "act too soon and you delete something still in use, act too late and you waste resources." The post's framing primitive: "The challenge is timing: act too soon and you delete something still in use, act too late and you waste resources." In Cloud Topics specifically, L0 objects are temporary by design — they exist only as the intermediate write target before the Reconciler lifts their data into per-partition L1 files. The GC system is what makes "temporary" operational. (Source: this post)

  2. Distributed reference counting is rejected as the solution. The post walks the reader through the obvious framing — each chunk of unreconciled data is a reference, an L0 object is safe to delete when its reference count reaches zero — then explicitly rejects it: "this framing belies an ocean of complexity. First, these reference counts must be durable. Partitions themselves are spread across the cluster to balance load, so we'll need to support updates from anywhere. Do they need to be linearizable? Reference count updates are tightly correlated with the partition-local state that reconcilers use to make progress. Maybe they need to be atomic then? Would we be satisfied with eventual consistency? These are real questions about distributed systems design." Verbatim conclusion: "Redpanda does not track L0 objects this way." Canonicalised as epoch-based distributed GC. (Source: this post)

  3. Cluster epoch: a monotonically increasing counter embedded in every L0 object ID at creation time. "The cluster epoch is a monotonically increasing counter that we embed in every L0 object ID at creation time. Since the epoch is updated periodically and only ever increases, any given epoch E must eventually age out of the cluster. Once we have reconciled every object created in epoch E, it stands to reason that any L0 object with that epoch can be safely deleted." The cluster epoch is the load-bearing primitive — it converts a reference-counting question ("does anyone still reference this specific object?") into a monotonic-bound question ("has the cluster moved past this epoch?"). Canonicalised as concepts/cluster-epoch and patterns/epoch-stamp-on-object-id-for-gc. (Source: this post)

  4. Per-partition safe-to-GC epoch M(p) plus global minimum gives the cluster-wide M. "Given an oracle M(p) that gives the safe-to-GC epoch for any Cloud Topic partition, we can construct an aggregate M that's globally safe-to-GC. […] Given partitions P={p0,…pn} and safe epochs Ms = min(M(p) over p in P), it follows that M = min(Ms) is safe to GC by epoch monotonicity and the definition of min." This is the algebraic shape the rest of the post exists to make implementable: each partition contributes one monotonic local watermark, the cluster takes the min, and that min is safe to GC anywhere. (Source: this post)

  5. First attempt — single-epoch-per-partition rejected on leadership-change failure mode. Redpanda's "initial design tracked a single epoch per partition, the max across all produced placeholder batches, and rejected anything older on the replication path." This trivially preserved monotonicity and always-validity but was "too strict in practice. If partition leadership moves to a node with a stale epoch cache, we'll reject every new write until cache expiry, which could be minutes away." The leadership-change-with-stale-cache failure mode is the load-bearing reason a single-epoch tracker is insufficient at distributed-system altitude. (Source: this post)

  6. Sliding window of active epochs gives the same monotonicity with leadership-change tolerance. "Rather than approaching this as a distributed cache coherence problem (hard!), we can bake resilience to this epoch lag right into the algorithm. Local to each partition, we maintain a sliding window of active epochs. When we see a new epoch for the first time, slide the window forward. We still get monotonicity by construction, but we gain some flexibility to accept writes that were in flight when the window moved." The window converts a strict-rejection policy into a range-acceptance policy, eliminating the stale-cache write-stall without losing the monotonic invariant. Canonicalised as concepts/sliding-window-epoch-tracking. (Source: this post)

  7. Per-partition GC state lives in a dedicated Raft-replicated state machine in the partition's Raft log. "Each Cloud Topic partition maintains this sliding epoch window through a dedicated replicated state machine embedded in the partition's Raft log." Three load-bearing fields:

    Field Advance
    max_applied_epoch when a strictly greater epoch is committed
    previous_applied_epoch when we apply a new max_applied_epoch
    min_epoch_lower_bound when reconciler catches up to max_applied_epoch

    "As discussed, [previous_applied, max_applied] describes the range of active epochs we expect to see, and anything below this range is rejected before entering the replication pipeline. M(p) is simply prev(min_epoch_lower_bound)." The trick the additional min_epoch_lower_bound field plays: the window slides forward as soon as a new epoch is observed, but the safe-to-GC watermark only advances once the reconciler has confirmed all L0 data up to that point has been lifted to L1. Window-advance is decoupled from safe-watermark-advance. Canonicalised as patterns/per-partition-rsm-for-gc-tracking. (Source: this post)

  8. Clusterwide aggregate M piggybacks on existing periodic metadata-dissemination. "Now that every Cloud Topic partition p tracks an inactive epoch in its own Raft log, all that's left is to combine these into a single global M. Turns out we can piggyback this information on an existing periodic metadata- dissemination service internal to Redpanda." No new gossip protocol, no new control-plane service, no new RPC — Cloud Topics' GC reuses Redpanda's existing fleet-state distribution substrate. Canonicalised as patterns/lazy-aggregate-from-monotonic-local-state. (Source: this post)

  9. Monotonicity makes stale-data tolerance free. "If a node is temporarily operating on stale metadata, that's fine. A nice side effect of epoch monotonicity is that once we prove some M is safe, it never becomes unsafe. Every epoch < M is gone forever. Or until int64 rollover." This is the property that makes the lazy-aggregation shape correct: a node operating on a stale view of M(p) for some partition can only compute a min(Ms) that is smaller than the true value (more conservative — never deletes anything that's still in use), so the system is correct under arbitrary staleness. Latency cost, not correctness cost. (Source: this post)

  10. Architectural property: "no central index, no shared state, and no coordinated updates." Verbatim slogan. The whole design exists to deliver this property. Distributed reference counting fails it because the counts are shared mutable state requiring coordinated updates. Cluster epoch + sliding window + lazy minimum aggregation delivers safe-to-GC semantics with zero coordination on the GC path. Canonicalised across concepts/epoch-based-distributed-gc, patterns/per-partition-rsm-for-gc-tracking, and patterns/lazy-aggregate-from-monotonic-local-state. (Source: this post)

Architectural numbers and primitives

The post is structural — there are no production benchmarks, no fleet sizes, no L0 object counts per epoch, no epoch advancement cadences. The disclosed concrete primitives are:

  • 3 fields per partition's GC state machine: max_applied_epoch, previous_applied_epoch, min_epoch_lower_bound.
  • 2-tuple defining the active-epoch window: [previous_applied, max_applied].
  • M(p) = prev(min_epoch_lower_bound) — the per-partition safe-to-GC oracle.
  • M = min(M(p) over p in P) — the clusterwide aggregate.
  • Epoch counter type: post mentions "int64 rollover" in passing — implying epoch is an int64.
  • Dissemination substrate: existing periodic metadata- dissemination service in Redpanda — name not disclosed.

What the post does not disclose:

  • Epoch-advancement frequency (per second? per minute? per L0 flush?).
  • The cluster-epoch advancement protocol — who decides to bump the epoch, and how is it propagated to brokers about to assemble a new L0?
  • Window width (how many epochs [previous_applied, max_applied] spans in practice).
  • The specific mechanism by which the reconciler signals that all L0 data for an epoch has been lifted to L1 — the post says "reconciler catches up to max_applied_epoch" but does not describe how this is detected.
  • Behaviour under int64 rollover (mentioned only as a quip).
  • Failure modes: what happens if min_epoch_lower_bound advances but the corresponding L1 file is later corrupted or the Reconciler crashes mid-rewrite? Are there compensating mechanisms?
  • Whether the metadata-dissemination service is itself Raft-replicated or eventually-consistent.
  • Part 2 — the actual deletion mechanism, including how "thousands of L0 objects" are deleted "continually" without "locally persistent state, explicit coordination, or wasted work."

Caveats

  1. Part 1 of 2 — the deletion mechanism is forward-referenced, not described. This post's contribution is the safe-to-GC decision. How the system actually goes and deletes objects identified as safe — the protocol for issuing DELETE RPCs to object storage at scale, the work distribution across brokers, the idempotency strategy if a broker crashes mid-delete — is "part 2" and not included. Forward-reference verbatim: "Stay tuned for part 2, where we discuss how the garbage collector's design enables us to continually delete thousands of L0 objects without any locally persistent state, explicit coordination, or wasted work."

  2. Pedagogical framing — the reference-counting rejection sequence is a teaching device, not a chronological record. The post structures itself as "here's the obvious solution → here's why it doesn't work → here's what we built." This is effective exposition but flattens what was likely a more iterative internal design process.

  3. No benchmark data. No L0 object counts, no GC throughput, no GC-induced object-storage cost numbers, no comparison vs alternative GC schemes. The post is a mechanism description, not a performance claim.

  4. Reconciler-catches-up signal is hand-waved. The most load-bearing field, min_epoch_lower_bound, advances "when reconciler catches up to max_applied_epoch." What the reconciler actually has to do to prove it has caught up — and how that proof is committed to the Raft log such that the RSM can advance the field — is not specified. This is likely structural (the reconciler emits a Raft log entry on catch-up, the RSM transitions on apply) but the post does not say.

  5. Single epoch per L0 — implicit assumption. The post asserts that an L0 object has the cluster epoch embedded at creation time, implying L0 objects do not span epochs. Whether this constraint is enforced architecturally (e.g. epoch advance flushes the in-memory cross-partition staging buffer) is not disclosed.

  6. Epoch advancement protocol not disclosed. "The cluster epoch is updated periodically" — but who updates it, on what schedule, and how is it propagated to brokers in a way that guarantees a broker assembling a new L0 always uses the latest epoch (or at least never an earlier-than-believed-current epoch)? The post does not disclose this. There is an implicit assumption that all brokers eventually see a given epoch but the convergence time is not bounded.

  7. previous_applied_epoch is described but its operational role is left implicit. The table lists it as a tracked field but the post does not call out what reads use it. Inferred role: it defines the lower bound of the active window [previous_applied, max_applied], so it's used on the admission-control path (rejecting writes with epoch < previous_applied). The post would benefit from making this explicit.

  8. Comparison to other epoch-based GC schemes is absent. The post does not cite prior art — e.g. epoch-based memory reclamation in lock-free data structures (Fraser; Hart et al.), epoch-based GC in concurrent hash maps, or related distributed GC schemes. The technique is presented as a Redpanda-specific solution rather than situated in the broader literature, which limits the reader's ability to triangulate trade-offs.

Source

Systems

  • systems/redpanda-cloud-topics — the system the post is about; ingest extends this page with the new L0 GC section
  • systems/redpanda — the broker substrate; Cloud Topics' GC reuses Redpanda's existing per-partition Raft logs and the internal metadata-dissemination service

Concepts

Patterns

Companies

Last updated · 542 distilled / 1,571 read