Skip to content

CONCEPT Cited by 1 source

Sliding-window epoch tracking

Definition

Sliding-window epoch tracking is the per-shard mechanism that publishes a monotonically non-decreasing safe-to-GC watermark M(p) for use in epoch-based distributed garbage collection, by maintaining a range of active epochs [previous_applied, max_applied] rather than a single tracked epoch. The window slides forward as new epochs are observed, but a separate watermark advances only after a local processor confirms catch-up, decoupling window advance from safe-watermark advance.

The shape is canonicalised by Redpanda Cloud Topics for L0 garbage-collection state in each partition's Raft log. (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)

The motivating failure mode

The naive design tracks one epoch per partition — the maximum across all observed events, with anything older rejected. Monotonicity preserved by construction. The Cloud Topics post explains why this is too strict:

"Our initial design tracked a single epoch per partition, the max across all produced placeholder batches, and rejected anything older on the replication path. This trivially supports both invariants, but it's too strict in practice. If partition leadership moves to a node with a stale epoch cache, we'll reject every new write until cache expiry, which could be minutes away. Not ideal."

The failure mode: leadership change to a node with a stale view of the current epoch causes every new write to be rejected until that node's cache catches up. Operationally — minutes of write-stall on every leadership transition.

The relaxation

Instead of a single epoch, maintain a sliding window:

"Rather than approaching this as a distributed cache coherence problem (hard!), we can bake resilience to this epoch lag right into the algorithm. Local to each partition, we maintain a sliding window of active epochs. When we see a new epoch for the first time, slide the window forward. We still get monotonicity by construction, but we gain some flexibility to accept writes that were in flight when the window moved."

The window "slides" — its lower edge moves forward when its upper edge does — but it accepts any write whose epoch falls inside the range, not just one matching the upper edge. This converts a distributed-cache-coherence problem ("how do we synchronously inform a new leader of the current epoch?") into an in-algorithm admission-control rule ("accept writes within the window we advertise").

State machine fields

The Cloud Topics instance uses three fields, embedded in a replicated state machine in each partition's Raft log:

Field Advance trigger
max_applied_epoch when a strictly greater epoch is committed
previous_applied_epoch when we apply a new max_applied_epoch
min_epoch_lower_bound when reconciler catches up to max_applied_epoch

Active range: [previous_applied_epoch, max_applied_epoch]. Writes with epoch outside this range are rejected before entering the replication pipeline.

The published per-shard safe-to-GC watermark:

M(p) = prev(min_epoch_lower_bound)

Where prev(x) is the predecessor of x in the epoch number line (typically x - 1).

Why three fields, not two

A single field max_applied plus rejection-of-older-than-max gives the strict design that fails on leadership-change. Two fields [previous_applied, max_applied] give the sliding-window admission control. Why a third field?

Because window advance is not the same as safe-to-GC advance:

  • The window slides forward when a new max epoch is committed — this is fast and only depends on what the partition has seen.
  • The safe-to-GC watermark advances only after a local processor (in Cloud Topics: the Reconciler) has confirmed it has finished with all data up to the previous max — this is slow and depends on actual processing.

Verbatim from the post:

"the computation of M(p) is actually a bit stricter than what we described before; that's because reconciler progress gives the final word on which epochs are safe to delete. So while the window itself slides forward as soon as a new epoch appears, we only advance the safe epoch once we're sure all the L0 data up to that point has been reconciled into L1."

The third field is the decoupling point between fast-advancing admission-control state and slow-advancing reclamation-safety state. Without it, either:

  • Admission control would be coupled to reconciliation progress (and the leadership-change stall would return), or
  • Safe-to-GC would advance based on observed-max alone (and the Reconciler's catch-up condition would not be respected, leading to deletes of still-in-use data).

What the window protects against

Failure mode Strict-rejection design Sliding window
Leadership change with stale epoch cache Minutes of write-stall Continues accepting writes within window
In-flight writes during epoch advance Rejected Accepted (within window)
Out-of-order epoch observations from peers Rejected Accepted (within window)
Epoch below previous_applied_epoch Rejected (correct) Rejected (correct)
Epoch above max_applied_epoch Triggers window slide Triggers window slide

The window is not a tolerance for unbounded staleness — writes with epoch below previous_applied_epoch are still rejected. It's a tolerance for recent staleness, sized to the cache-coherence gap (typically seconds-to-tens-of-seconds during leadership transitions).

Monotonicity invariant

Both the strict-rejection design and the sliding-window design preserve the monotonicity invariant the epoch-based GC technique relies on:

  • max_applied_epoch is monotonically non-decreasing (only advances on commit of a strictly greater value).
  • min_epoch_lower_bound is monotonically non-decreasing (only advances on Reconciler catch-up confirmation).
  • M(p) = prev(min_epoch_lower_bound) is monotonically non-decreasing.

Therefore the clusterwide M = min(M(p)) is monotonically non-decreasing, and once an object's epoch falls below M it remains below M forever. The post calls this out: "once we prove some M is safe, it never becomes unsafe."

Why a replicated state machine, not just cached state

Each partition's GC state lives in its Raft log, not in a broker-local cache. This buys:

  • Durability: state survives broker restarts and crashes.
  • Fencing: a stale leader cannot publish a stale M(p) after the partition has moved on — Raft term-fencing already prevents this.
  • Atomicity with admission control: the same Raft log entry that advances max_applied_epoch is what admits the write. No race between "window says yes" and "epoch advanced while I was deciding."
  • Free leadership-change behaviour: a new leader replays the Raft log to recover the same window state. No external cache to warm.

See patterns/per-partition-rsm-for-gc-tracking.

Sliding window vs other watermark mechanisms

Mechanism What's tracked What advances it Stall on leadership change
Single-epoch tracker (rejected) max_observed_epoch New observation Yes — stale cache stalls writes
Sliding-window epoch tracking [previous_applied, max_applied, min_epoch_lower_bound] Observation + processor catch-up No — window absorbs cache lag
Last Reconciled Offset Per-partition offset watermark Reconciler progress No — sibling shape, but for read routing
Stream-processing watermark Event-time low-water-mark Source progress + window-close No — but different correctness model

Cloud Topics uses both sliding-window epoch tracking (for GC admission and safe-to-delete) and Last Reconciled Offset (for read-routing between L0 and L1) — they're sibling per-partition watermarks serving different consumers. The window-and-watermark shape recurs at multiple altitudes within Cloud Topics.

Seen in

  • sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection — canonical wiki instance. Per-partition replicated state machine embedded in the partition's Raft log; three fields (max_applied_epoch, previous_applied_epoch, min_epoch_lower_bound); admission control rejects epochs below the window; safe-to-GC watermark M(p) = prev(min_epoch_lower_bound) advances only on Reconciler catch-up.
Last updated · 542 distilled / 1,571 read