CONCEPT Cited by 1 source
Cluster epoch¶
Definition¶
A cluster epoch is a monotonically increasing, cluster-global
counter that is embedded into the durable identifier of an object
at the moment it is created, so that the question "is this object
still in use?" can be reduced to "has the cluster moved past this
epoch?" The epoch advances periodically; once every object created
in epoch E has been processed (e.g. compacted, lifted, indexed,
acknowledged), every object stamped with epoch E is by definition
no longer needed and can be safely reclaimed.
The named primitive is from Redpanda Cloud Topics:
"The cluster epoch is a monotonically increasing counter that we embed in every L0 object ID at creation time. Since the epoch is updated periodically and only ever increases, any given epoch E must eventually age out of the cluster. Once we have reconciled every object created in epoch E, it stands to reason that any L0 object with that epoch can be safely deleted." (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)
Why it works¶
The cluster epoch converts a reference- counting question into a monotonic-bound question:
| Approach | Question | Distributed-systems cost |
|---|---|---|
| Reference count | "Does any chunk of unreconciled data still reference this object?" | Durable + linearizable + coordinated reference counts; updates from anywhere; per-object metadata |
| Cluster epoch | "Has the cluster proven it has moved past epoch E?" | One global watermark; monotonic; lazy aggregation from local state suffices |
The monotonicity property is load-bearing in two ways:
- At the stamping site: an object's epoch is fixed at creation and never changes. There's nothing to update later.
- At the safe-to-reclaim site: once we prove some safe-to-GC
epoch
M, it never becomes unsafe — every epoch< Mis gone forever (per the [[sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection|2026-05-19 Redpanda post]]: "once we prove some M is safe, it never becomes unsafe. Every epoch < M is gone forever. Or until int64 rollover."). This makes any aggregation that producesMsafe under arbitrary staleness — a stale observer can only compute a smaller (more conservative)M.
What "epoch" means in this context¶
The term epoch is overloaded across distributed-systems literature. In this concept's specific sense:
- Cluster epoch (this concept): A coarse-grained logical timestamp shared by an entire cluster, advanced periodically, embedded in object IDs at creation. Used as the substrate for epoch-based distributed GC.
- Raft term / leader epoch (different concept): The monotonically increasing counter that uniquely identifies a leader's tenure in a Raft / Paxos protocol. Used for fencing stale leaders, not for object lifecycle.
- CUBIC epoch (different concept): The TCP CUBIC congestion-control window's recovery period. Domain-specific to congestion control.
- Epoch-based memory reclamation (related but distinct): The classical lock-free-data-structures technique where threads enter / exit "epochs" and memory is reclaimed once no thread is in an old epoch. The cluster-epoch concept generalises this to the cluster altitude with object stamps replacing thread enter/exit markers.
How the epoch advances¶
The post is explicit that the cluster epoch advances "periodically" but does not disclose the exact mechanism. The properties required are:
- Monotonic: the global counter only ever increases. No epochs are reused, no epochs go backwards.
- Eventually advanced everywhere: a stamping broker eventually sees the new epoch. Convergence time bounds the maximum lag.
- Cluster-global, not partition-local: all partitions share the same epoch namespace. This is what makes a single global watermark meaningful.
Anti-pattern: per-partition epochs. If each partition advanced its own epoch independently, the "safe to delete" question would require per-partition reasoning at every read site. The cluster-global property is what lets the safe-to-GC watermark be one number.
What stamps the epoch¶
In the canonical Cloud Topics instance, the L0 object ID is stamped with the epoch at creation time. "We embed in every L0 object ID." The exact format is not disclosed, but the structure is "object ID = ⟨epoch, …⟩" — i.e. epoch is part of the identifier, not a side-channel attribute. This means:
- The epoch is recoverable from any reference to the object (storage path, metadata pointer, log entry).
- The object cannot be silently re-tagged.
- Listing objects in a bucket gives a histogram of epochs for free.
Per-shard safe-to-GC watermark M(p)¶
For the cluster-epoch concept to deliver coordination-free GC, each
shard (in Cloud Topics: each partition) must publish a local
watermark M(p) such that "every object stamped with epoch ≤ M(p)
that depends on shard p has been processed." The clusterwide safe
epoch is then M = min(M(p)) over all shards.
The mechanism for tracking M(p) is
sliding-window epoch
tracking — a per-partition replicated state machine that maintains
the active range [previous_applied, max_applied] and advances a
separate min_epoch_lower_bound field once the local processor
(in Cloud Topics: the Reconciler) has caught up.
Aggregation to global M¶
The clusterwide aggregate M = min(M(p)) is constructed lazily
via an existing periodic metadata-dissemination substrate (in Cloud
Topics: Redpanda's internal metadata-dissemination service). No new
gossip protocol or coordination primitive is required —
monotonicity makes stale observations safe, so the aggregate can be
computed best-effort. See
patterns/lazy-aggregate-from-monotonic-local-state.
Trade-offs vs reference counting¶
| Property | Reference counting | Cluster epoch |
|---|---|---|
| Per-object metadata | Yes (the count) | No (just the stamp, immutable) |
| Coordinated updates | Yes (count must be consistent) | No (epoch never changes after stamping) |
| Granularity | Per-object | Coarse — entire epoch's worth at once |
| Latency to reclaim | Low (delete on count → 0) |
Higher (wait for epoch to age out) |
| Sensitive to durability of count | Yes — losing counts breaks correctness | No — losing local state means recompute, no incorrect deletes |
| Sensitive to leader change | Possibly (count update at new leader) | Bounded — see concepts/sliding-window-epoch-tracking |
| Operational complexity | High in distributed setting | Low |
The trade-off is latency-to-reclaim for coordination-cost. Cluster epochs reclaim more slowly (an object can't be deleted until its entire epoch's cohort has finished processing), but they buy out the entire distributed-systems-design cost of maintaining durable, coordinated reference counts.
Constraints on the technique¶
The cluster-epoch primitive only works when:
- Objects are temporary by design. Long-lived objects don't benefit; an object that lives forever has no epoch to age out of. Cloud Topics' L0 files satisfy this (L0 is intermediate).
- The retention granularity is acceptable. All objects in
epoch
Eare reclaimed together. If finer granularity is needed (e.g. some objects retained for replay, some for compliance), per-object lifecycle policies are needed alongside. - Local watermark publishing is cheap. If
M(p)is expensive to compute or publish, the lazy-aggregation shape doesn't pay off. - There exists a clear "processed everywhere" condition per epoch. In Cloud Topics: "every object created in epoch E has been reconciled." If the "processed" condition is fuzzy or per-consumer, the concept doesn't apply cleanly.
Seen in¶
- sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection
— canonical wiki instance. Cluster epoch embedded in L0 object
IDs; per-partition
M(p)tracked in a Raft-replicated state machine; cluster-wideM = min(M(p))disseminated via the existing periodic metadata-distribution substrate.
Related¶
- concepts/epoch-based-distributed-gc — the GC technique that uses cluster epoch as its primitive.
- concepts/sliding-window-epoch-tracking — the per-partition
state-machine that publishes
M(p). - concepts/garbage-collection — the parent concept; reference counting is the rejected alternative.
- concepts/l0-l1-file-compaction-for-object-store-streaming — the file-layout that creates the "temporary by design" precondition.
- patterns/epoch-stamp-on-object-id-for-gc — the pattern of embedding the epoch in the durable identifier.
- patterns/per-partition-rsm-for-gc-tracking — the pattern of
publishing
M(p)from a Raft-replicated state machine. - patterns/lazy-aggregate-from-monotonic-local-state — the
pattern of computing global
Mfrom per-shardM(p). - systems/redpanda-cloud-topics — the canonical system.