Skip to content

CONCEPT Cited by 1 source

Cluster epoch

Definition

A cluster epoch is a monotonically increasing, cluster-global counter that is embedded into the durable identifier of an object at the moment it is created, so that the question "is this object still in use?" can be reduced to "has the cluster moved past this epoch?" The epoch advances periodically; once every object created in epoch E has been processed (e.g. compacted, lifted, indexed, acknowledged), every object stamped with epoch E is by definition no longer needed and can be safely reclaimed.

The named primitive is from Redpanda Cloud Topics:

"The cluster epoch is a monotonically increasing counter that we embed in every L0 object ID at creation time. Since the epoch is updated periodically and only ever increases, any given epoch E must eventually age out of the cluster. Once we have reconciled every object created in epoch E, it stands to reason that any L0 object with that epoch can be safely deleted." (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)

Why it works

The cluster epoch converts a reference- counting question into a monotonic-bound question:

Approach Question Distributed-systems cost
Reference count "Does any chunk of unreconciled data still reference this object?" Durable + linearizable + coordinated reference counts; updates from anywhere; per-object metadata
Cluster epoch "Has the cluster proven it has moved past epoch E?" One global watermark; monotonic; lazy aggregation from local state suffices

The monotonicity property is load-bearing in two ways:

  1. At the stamping site: an object's epoch is fixed at creation and never changes. There's nothing to update later.
  2. At the safe-to-reclaim site: once we prove some safe-to-GC epoch M, it never becomes unsafe — every epoch < M is gone forever (per the [[sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection|2026-05-19 Redpanda post]]: "once we prove some M is safe, it never becomes unsafe. Every epoch < M is gone forever. Or until int64 rollover."). This makes any aggregation that produces M safe under arbitrary staleness — a stale observer can only compute a smaller (more conservative) M.

What "epoch" means in this context

The term epoch is overloaded across distributed-systems literature. In this concept's specific sense:

  • Cluster epoch (this concept): A coarse-grained logical timestamp shared by an entire cluster, advanced periodically, embedded in object IDs at creation. Used as the substrate for epoch-based distributed GC.
  • Raft term / leader epoch (different concept): The monotonically increasing counter that uniquely identifies a leader's tenure in a Raft / Paxos protocol. Used for fencing stale leaders, not for object lifecycle.
  • CUBIC epoch (different concept): The TCP CUBIC congestion-control window's recovery period. Domain-specific to congestion control.
  • Epoch-based memory reclamation (related but distinct): The classical lock-free-data-structures technique where threads enter / exit "epochs" and memory is reclaimed once no thread is in an old epoch. The cluster-epoch concept generalises this to the cluster altitude with object stamps replacing thread enter/exit markers.

How the epoch advances

The post is explicit that the cluster epoch advances "periodically" but does not disclose the exact mechanism. The properties required are:

  1. Monotonic: the global counter only ever increases. No epochs are reused, no epochs go backwards.
  2. Eventually advanced everywhere: a stamping broker eventually sees the new epoch. Convergence time bounds the maximum lag.
  3. Cluster-global, not partition-local: all partitions share the same epoch namespace. This is what makes a single global watermark meaningful.

Anti-pattern: per-partition epochs. If each partition advanced its own epoch independently, the "safe to delete" question would require per-partition reasoning at every read site. The cluster-global property is what lets the safe-to-GC watermark be one number.

What stamps the epoch

In the canonical Cloud Topics instance, the L0 object ID is stamped with the epoch at creation time. "We embed in every L0 object ID." The exact format is not disclosed, but the structure is "object ID = ⟨epoch, …⟩" — i.e. epoch is part of the identifier, not a side-channel attribute. This means:

  • The epoch is recoverable from any reference to the object (storage path, metadata pointer, log entry).
  • The object cannot be silently re-tagged.
  • Listing objects in a bucket gives a histogram of epochs for free.

Per-shard safe-to-GC watermark M(p)

For the cluster-epoch concept to deliver coordination-free GC, each shard (in Cloud Topics: each partition) must publish a local watermark M(p) such that "every object stamped with epoch ≤ M(p) that depends on shard p has been processed." The clusterwide safe epoch is then M = min(M(p)) over all shards.

The mechanism for tracking M(p) is sliding-window epoch tracking — a per-partition replicated state machine that maintains the active range [previous_applied, max_applied] and advances a separate min_epoch_lower_bound field once the local processor (in Cloud Topics: the Reconciler) has caught up.

Aggregation to global M

The clusterwide aggregate M = min(M(p)) is constructed lazily via an existing periodic metadata-dissemination substrate (in Cloud Topics: Redpanda's internal metadata-dissemination service). No new gossip protocol or coordination primitive is required — monotonicity makes stale observations safe, so the aggregate can be computed best-effort. See patterns/lazy-aggregate-from-monotonic-local-state.

Trade-offs vs reference counting

Property Reference counting Cluster epoch
Per-object metadata Yes (the count) No (just the stamp, immutable)
Coordinated updates Yes (count must be consistent) No (epoch never changes after stamping)
Granularity Per-object Coarse — entire epoch's worth at once
Latency to reclaim Low (delete on count → 0) Higher (wait for epoch to age out)
Sensitive to durability of count Yes — losing counts breaks correctness No — losing local state means recompute, no incorrect deletes
Sensitive to leader change Possibly (count update at new leader) Bounded — see concepts/sliding-window-epoch-tracking
Operational complexity High in distributed setting Low

The trade-off is latency-to-reclaim for coordination-cost. Cluster epochs reclaim more slowly (an object can't be deleted until its entire epoch's cohort has finished processing), but they buy out the entire distributed-systems-design cost of maintaining durable, coordinated reference counts.

Constraints on the technique

The cluster-epoch primitive only works when:

  1. Objects are temporary by design. Long-lived objects don't benefit; an object that lives forever has no epoch to age out of. Cloud Topics' L0 files satisfy this (L0 is intermediate).
  2. The retention granularity is acceptable. All objects in epoch E are reclaimed together. If finer granularity is needed (e.g. some objects retained for replay, some for compliance), per-object lifecycle policies are needed alongside.
  3. Local watermark publishing is cheap. If M(p) is expensive to compute or publish, the lazy-aggregation shape doesn't pay off.
  4. There exists a clear "processed everywhere" condition per epoch. In Cloud Topics: "every object created in epoch E has been reconciled." If the "processed" condition is fuzzy or per-consumer, the concept doesn't apply cleanly.

Seen in

Last updated · 542 distilled / 1,571 read