CONCEPT Cited by 1 source
Sliding-window epoch tracking¶
Definition¶
Sliding-window epoch tracking is the per-shard mechanism that
publishes a monotonically non-decreasing safe-to-GC watermark
M(p) for use in epoch-based
distributed garbage collection, by maintaining a range of
active epochs [previous_applied, max_applied] rather than a
single tracked epoch. The window slides forward as new epochs are
observed, but a separate watermark advances only after a local
processor confirms catch-up, decoupling window advance from
safe-watermark advance.
The shape is canonicalised by Redpanda Cloud Topics for L0 garbage-collection state in each partition's Raft log. (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)
The motivating failure mode¶
The naive design tracks one epoch per partition — the maximum across all observed events, with anything older rejected. Monotonicity preserved by construction. The Cloud Topics post explains why this is too strict:
"Our initial design tracked a single epoch per partition, the max across all produced placeholder batches, and rejected anything older on the replication path. This trivially supports both invariants, but it's too strict in practice. If partition leadership moves to a node with a stale epoch cache, we'll reject every new write until cache expiry, which could be minutes away. Not ideal."
The failure mode: leadership change to a node with a stale view of the current epoch causes every new write to be rejected until that node's cache catches up. Operationally — minutes of write-stall on every leadership transition.
The relaxation¶
Instead of a single epoch, maintain a sliding window:
"Rather than approaching this as a distributed cache coherence problem (hard!), we can bake resilience to this epoch lag right into the algorithm. Local to each partition, we maintain a sliding window of active epochs. When we see a new epoch for the first time, slide the window forward. We still get monotonicity by construction, but we gain some flexibility to accept writes that were in flight when the window moved."
The window "slides" — its lower edge moves forward when its upper edge does — but it accepts any write whose epoch falls inside the range, not just one matching the upper edge. This converts a distributed-cache-coherence problem ("how do we synchronously inform a new leader of the current epoch?") into an in-algorithm admission-control rule ("accept writes within the window we advertise").
State machine fields¶
The Cloud Topics instance uses three fields, embedded in a replicated state machine in each partition's Raft log:
| Field | Advance trigger |
|---|---|
max_applied_epoch |
when a strictly greater epoch is committed |
previous_applied_epoch |
when we apply a new max_applied_epoch |
min_epoch_lower_bound |
when reconciler catches up to max_applied_epoch |
Active range: [previous_applied_epoch, max_applied_epoch]. Writes
with epoch outside this range are rejected before entering the
replication pipeline.
The published per-shard safe-to-GC watermark:
Where prev(x) is the predecessor of x in the epoch number line
(typically x - 1).
Why three fields, not two¶
A single field max_applied plus rejection-of-older-than-max gives
the strict design that fails on leadership-change. Two fields
[previous_applied, max_applied] give the sliding-window admission
control. Why a third field?
Because window advance is not the same as safe-to-GC advance:
- The window slides forward when a new max epoch is committed — this is fast and only depends on what the partition has seen.
- The safe-to-GC watermark advances only after a local processor (in Cloud Topics: the Reconciler) has confirmed it has finished with all data up to the previous max — this is slow and depends on actual processing.
Verbatim from the post:
"the computation of M(p) is actually a bit stricter than what we described before; that's because reconciler progress gives the final word on which epochs are safe to delete. So while the window itself slides forward as soon as a new epoch appears, we only advance the safe epoch once we're sure all the L0 data up to that point has been reconciled into L1."
The third field is the decoupling point between fast-advancing admission-control state and slow-advancing reclamation-safety state. Without it, either:
- Admission control would be coupled to reconciliation progress (and the leadership-change stall would return), or
- Safe-to-GC would advance based on observed-max alone (and the Reconciler's catch-up condition would not be respected, leading to deletes of still-in-use data).
What the window protects against¶
| Failure mode | Strict-rejection design | Sliding window |
|---|---|---|
| Leadership change with stale epoch cache | Minutes of write-stall | Continues accepting writes within window |
| In-flight writes during epoch advance | Rejected | Accepted (within window) |
| Out-of-order epoch observations from peers | Rejected | Accepted (within window) |
Epoch below previous_applied_epoch |
Rejected (correct) | Rejected (correct) |
Epoch above max_applied_epoch |
Triggers window slide | Triggers window slide |
The window is not a tolerance for unbounded staleness — writes
with epoch below previous_applied_epoch are still rejected. It's
a tolerance for recent staleness, sized to the cache-coherence
gap (typically seconds-to-tens-of-seconds during leadership
transitions).
Monotonicity invariant¶
Both the strict-rejection design and the sliding-window design preserve the monotonicity invariant the epoch-based GC technique relies on:
max_applied_epochis monotonically non-decreasing (only advances on commit of a strictly greater value).min_epoch_lower_boundis monotonically non-decreasing (only advances on Reconciler catch-up confirmation).M(p) = prev(min_epoch_lower_bound)is monotonically non-decreasing.
Therefore the clusterwide M = min(M(p)) is monotonically
non-decreasing, and once an object's epoch falls below M it
remains below M forever. The post calls this out: "once we
prove some M is safe, it never becomes unsafe."
Why a replicated state machine, not just cached state¶
Each partition's GC state lives in its Raft log, not in a broker-local cache. This buys:
- Durability: state survives broker restarts and crashes.
- Fencing: a stale leader cannot publish a stale
M(p)after the partition has moved on — Raft term-fencing already prevents this. - Atomicity with admission control: the same Raft log entry
that advances
max_applied_epochis what admits the write. No race between "window says yes" and "epoch advanced while I was deciding." - Free leadership-change behaviour: a new leader replays the Raft log to recover the same window state. No external cache to warm.
See patterns/per-partition-rsm-for-gc-tracking.
Sliding window vs other watermark mechanisms¶
| Mechanism | What's tracked | What advances it | Stall on leadership change |
|---|---|---|---|
| Single-epoch tracker (rejected) | max_observed_epoch |
New observation | Yes — stale cache stalls writes |
| Sliding-window epoch tracking | [previous_applied, max_applied, min_epoch_lower_bound] |
Observation + processor catch-up | No — window absorbs cache lag |
| Last Reconciled Offset | Per-partition offset watermark | Reconciler progress | No — sibling shape, but for read routing |
| Stream-processing watermark | Event-time low-water-mark | Source progress + window-close | No — but different correctness model |
Cloud Topics uses both sliding-window epoch tracking (for GC admission and safe-to-delete) and Last Reconciled Offset (for read-routing between L0 and L1) — they're sibling per-partition watermarks serving different consumers. The window-and-watermark shape recurs at multiple altitudes within Cloud Topics.
Seen in¶
- sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection
— canonical wiki instance. Per-partition replicated state machine
embedded in the partition's Raft log; three fields
(
max_applied_epoch,previous_applied_epoch,min_epoch_lower_bound); admission control rejects epochs below the window; safe-to-GC watermarkM(p) = prev(min_epoch_lower_bound)advances only on Reconciler catch-up.
Related¶
- concepts/cluster-epoch — the global counter the window tracks against.
- concepts/epoch-based-distributed-gc — the GC technique this mechanism publishes the local watermark for.
- concepts/garbage-collection — the parent concept.
- concepts/last-reconciled-offset — sibling per-partition watermark in Cloud Topics, serving read-routing instead of GC.
- patterns/per-partition-rsm-for-gc-tracking — the RSM-in-Raft-log pattern this mechanism inhabits.
- patterns/lazy-aggregate-from-monotonic-local-state — the
global-aggregation pattern that consumes the published
M(p). - systems/redpanda-cloud-topics — the canonical system.