CONCEPT Cited by 1 source

Sliding-window epoch tracking¶

Definition¶

Sliding-window epoch tracking is the per-shard mechanism that publishes a monotonically non-decreasing safe-to-GC watermark M(p) for use in epoch-based distributed garbage collection, by maintaining a range of active epochs [previous_applied, max_applied] rather than a single tracked epoch. The window slides forward as new epochs are observed, but a separate watermark advances only after a local processor confirms catch-up, decoupling window advance from safe-watermark advance.

The shape is canonicalised by Redpanda Cloud Topics for L0 garbage-collection state in each partition's Raft log. (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)

The motivating failure mode¶

The naive design tracks one epoch per partition — the maximum across all observed events, with anything older rejected. Monotonicity preserved by construction. The Cloud Topics post explains why this is too strict:

"Our initial design tracked a single epoch per partition, the max across all produced placeholder batches, and rejected anything older on the replication path. This trivially supports both invariants, but it's too strict in practice. If partition leadership moves to a node with a stale epoch cache, we'll reject every new write until cache expiry, which could be minutes away. Not ideal."

The failure mode: leadership change to a node with a stale view of the current epoch causes every new write to be rejected until that node's cache catches up. Operationally — minutes of write-stall on every leadership transition.

The relaxation¶

Instead of a single epoch, maintain a sliding window:

"Rather than approaching this as a distributed cache coherence problem (hard!), we can bake resilience to this epoch lag right into the algorithm. Local to each partition, we maintain a sliding window of active epochs. When we see a new epoch for the first time, slide the window forward. We still get monotonicity by construction, but we gain some flexibility to accept writes that were in flight when the window moved."

The window "slides" — its lower edge moves forward when its upper edge does — but it accepts any write whose epoch falls inside the range, not just one matching the upper edge. This converts a distributed-cache-coherence problem ("how do we synchronously inform a new leader of the current epoch?") into an in-algorithm admission-control rule ("accept writes within the window we advertise").

State machine fields¶

The Cloud Topics instance uses three fields, embedded in a replicated state machine in each partition's Raft log:

Field	Advance trigger
`max_applied_epoch`	when a strictly greater epoch is committed
`previous_applied_epoch`	when we apply a new `max_applied_epoch`
`min_epoch_lower_bound`	when reconciler catches up to `max_applied_epoch`

Active range: [previous_applied_epoch, max_applied_epoch]. Writes with epoch outside this range are rejected before entering the replication pipeline.

The published per-shard safe-to-GC watermark:

M(p) = prev(min_epoch_lower_bound)

Where prev(x) is the predecessor of x in the epoch number line (typically x - 1).

Why three fields, not two¶

A single field max_applied plus rejection-of-older-than-max gives the strict design that fails on leadership-change. Two fields [previous_applied, max_applied] give the sliding-window admission control. Why a third field?

Because window advance is not the same as safe-to-GC advance:

The window slides forward when a new max epoch is committed — this is fast and only depends on what the partition has seen.
The safe-to-GC watermark advances only after a local processor (in Cloud Topics: the Reconciler) has confirmed it has finished with all data up to the previous max — this is slow and depends on actual processing.

Verbatim from the post:

"the computation of M(p) is actually a bit stricter than what we described before; that's because reconciler progress gives the final word on which epochs are safe to delete. So while the window itself slides forward as soon as a new epoch appears, we only advance the safe epoch once we're sure all the L0 data up to that point has been reconciled into L1."

The third field is the decoupling point between fast-advancing admission-control state and slow-advancing reclamation-safety state. Without it, either:

Admission control would be coupled to reconciliation progress (and the leadership-change stall would return), or
Safe-to-GC would advance based on observed-max alone (and the Reconciler's catch-up condition would not be respected, leading to deletes of still-in-use data).

What the window protects against¶

Failure mode	Strict-rejection design	Sliding window
Leadership change with stale epoch cache	Minutes of write-stall	Continues accepting writes within window
In-flight writes during epoch advance	Rejected	Accepted (within window)
Out-of-order epoch observations from peers	Rejected	Accepted (within window)
Epoch below `previous_applied_epoch`	Rejected (correct)	Rejected (correct)
Epoch above `max_applied_epoch`	Triggers window slide	Triggers window slide

The window is not a tolerance for unbounded staleness — writes with epoch below previous_applied_epoch are still rejected. It's a tolerance for recent staleness, sized to the cache-coherence gap (typically seconds-to-tens-of-seconds during leadership transitions).

Monotonicity invariant¶

Both the strict-rejection design and the sliding-window design preserve the monotonicity invariant the epoch-based GC technique relies on:

max_applied_epoch is monotonically non-decreasing (only advances on commit of a strictly greater value).
min_epoch_lower_bound is monotonically non-decreasing (only advances on Reconciler catch-up confirmation).
M(p) = prev(min_epoch_lower_bound) is monotonically non-decreasing.

Therefore the clusterwide M = min(M(p)) is monotonically non-decreasing, and once an object's epoch falls below M it remains below M forever. The post calls this out: "once we prove some M is safe, it never becomes unsafe."

Why a replicated state machine, not just cached state¶

Each partition's GC state lives in its Raft log, not in a broker-local cache. This buys:

Durability: state survives broker restarts and crashes.
Fencing: a stale leader cannot publish a stale M(p) after the partition has moved on — Raft term-fencing already prevents this.
Atomicity with admission control: the same Raft log entry that advances max_applied_epoch is what admits the write. No race between "window says yes" and "epoch advanced while I was deciding."
Free leadership-change behaviour: a new leader replays the Raft log to recover the same window state. No external cache to warm.

See patterns/per-partition-rsm-for-gc-tracking.

Sliding window vs other watermark mechanisms¶

Mechanism	What's tracked	What advances it	Stall on leadership change
Single-epoch tracker (rejected)	`max_observed_epoch`	New observation	Yes — stale cache stalls writes
Sliding-window epoch tracking	`[previous_applied, max_applied, min_epoch_lower_bound]`	Observation + processor catch-up	No — window absorbs cache lag
Last Reconciled Offset	Per-partition offset watermark	Reconciler progress	No — sibling shape, but for read routing
Stream-processing watermark	Event-time low-water-mark	Source progress + window-close	No — but different correctness model

Cloud Topics uses both sliding-window epoch tracking (for GC admission and safe-to-delete) and Last Reconciled Offset (for read-routing between L0 and L1) — they're sibling per-partition watermarks serving different consumers. The window-and-watermark shape recurs at multiple altitudes within Cloud Topics.

Seen in¶

sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection — canonical wiki instance. Per-partition replicated state machine embedded in the partition's Raft log; three fields (max_applied_epoch, previous_applied_epoch, min_epoch_lower_bound); admission control rejects epochs below the window; safe-to-GC watermark M(p) = prev(min_epoch_lower_bound) advances only on Reconciler catch-up.

concepts/cluster-epoch — the global counter the window tracks against.
concepts/epoch-based-distributed-gc — the GC technique this mechanism publishes the local watermark for.
concepts/garbage-collection — the parent concept.
concepts/last-reconciled-offset — sibling per-partition watermark in Cloud Topics, serving read-routing instead of GC.
patterns/per-partition-rsm-for-gc-tracking — the RSM-in-Raft-log pattern this mechanism inhabits.
patterns/lazy-aggregate-from-monotonic-local-state — the global-aggregation pattern that consumes the published M(p).
systems/redpanda-cloud-topics — the canonical system.