PATTERN Cited by 1 source
Per-partition RSM for GC tracking¶
Pattern¶
When a sharded system needs each shard to publish a per-shard
garbage-collection watermark M(p), embed the GC state machine
into the shard's existing replicated consensus log as a
dedicated replicated state machine (RSM), rather than maintaining
GC state in a broker-local cache, an external metadata store, or a
separate Raft group. The RSM inherits the shard's existing
durability, leader-fencing, and HA properties for free, and
admission-control rules for the GC primitive (e.g. "reject writes
outside the active epoch window") co-locate atomically with the
write path itself.
Canonicalised by Redpanda Cloud Topics for L0 GC tracking:
"Each Cloud Topic partition maintains this sliding epoch window through a dedicated replicated state machine embedded in the partition's Raft log." (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)
Mechanics¶
┌─────────────────────────────────────────────────────────────┐
│ Partition p (Raft group) │
│ │
│ Existing Raft log: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ entry: data record │ │
│ │ entry: data record │ │
│ │ entry: GC RSM transition (epoch advance) │ │ ← co-located
│ │ entry: data record │ │
│ │ entry: GC RSM transition (reconciler catch-up) │ │ ← co-located
│ │ entry: data record │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Apply: deterministic state machine │
│ │ │
│ ▼ │
│ Per-partition state: │
│ max_applied_epoch ────┐ │
│ previous_applied_epoch ├─► M(p) = prev(min_epoch_lower_bound) │
│ min_epoch_lower_bound ──┘ (published to dissemination)│
│ │
│ Admission control reads from same state: │
│ if write.epoch ∉ [prev, max]: reject │
└─────────────────────────────────────────────────────────────┘
The RSM is deterministic — the same log replayed on a fresh replica produces the same state. This is the standard Raft RSM shape; the contribution of the pattern is using it for GC state specifically.
What this buys you¶
| Property | Per-partition RSM (this pattern) | Broker-local cache | External metadata store | Dedicated Raft group |
|---|---|---|---|---|
| Durability | Free (Raft log persisted) | Lost on broker restart | Yes | Yes |
| Leader fencing | Free (Raft term-fencing) | Cache-coherence problem | Custom | Yes |
| Atomicity with write path | Free (same log entry) | Race window | Cross-system 2PC | Cross-Raft 2PC |
| Replay from log | Free (existing replay) | N/A | Custom | Yes |
| Operational overhead | Zero | Low | High (separate service) | Medium (extra Raft group) |
| Failure modes | Inherits partition's | New ones | New ones | New ones |
The pattern's load-bearing property is the atomicity-with-write- path column. The Cloud Topics admission-control rule — "reject anything older than the window before entering the replication pipeline" — works because the window state and the write are co-located in the same Raft log. There is no race where the window advanced "in between" the admission-check and the commit.
The leadership-change advantage¶
When partition leadership moves, the new leader replays the Raft log to recover all state — including the GC RSM. This means:
- No external cache to warm. A broker-local cache requires a synchronisation step on leadership change. The RSM doesn't.
- No coherence protocol. The Cloud Topics post calls out the alternative: "Rather than approaching this as a distributed cache coherence problem (hard!), we can bake resilience to this epoch lag right into the algorithm." The RSM-in-Raft-log shape is what makes baking resilience into the algorithm viable.
- Same correctness as the pre-existing partition. Whatever the partition's existing RPO and consistency properties are, the GC RSM has them too.
This composes with concepts/sliding-window-epoch-tracking (the specific RSM contents Cloud Topics uses) to give the "leadership-change does not stall writes" property.
Composition with patterns/epoch-stamp-on-object-id-for-gc¶
The RSM publishes M(p). The stamp pattern stamps each object
with its creation epoch. Together they form the per-shard half of
epoch-based distributed
GC:
Stamp pattern Per-partition RSM Lazy aggregation
──────────── ────────────────── ─────────────────
obj.epoch ◄ ─ ─ ─ ─ M(p) = prev(min_lb) ─ ─ ► M = min(M(p))
(at creation) (in Raft log) (over all shards)
│
▼
if obj.epoch ≤ M:
delete(obj)
The third pattern, patterns/lazy-aggregate-from-monotonic-local-state, handles the M(p) → M aggregation.
Constraints¶
The pattern applies cleanly when:
- Each shard already has a replicated consensus log. Cloud Topics has Raft per partition. Without an existing per-shard log, the pattern's overhead column becomes "add a per-shard Raft group just for GC" — usually not worth it.
- The GC state is small and bounded. A handful of integer fields. If the GC state were large (e.g. per-object reference counts), the storage and replication overhead would erode the pattern's benefits.
- Admission control wants the same atomicity. The pattern shines when the write path's admission rules read from the GC state. If the GC state is purely an output (no admission control), simpler designs work too.
- The local processor that drives
M(p)advancement can commit to the log. In Cloud Topics: the Reconciler emits a "caught up" marker that the RSM consumes. Without a well-defined catch-up signal,min_epoch_lower_bound(or its equivalent) has nothing to advance from.
Anti-patterns¶
- Broker-local cache only. Loses durability and creates a cache-coherence problem on leadership change.
- Separate Raft group for GC. Doubles the Raft groups, introduces a 2-Raft-group commit problem to keep admission control in sync with writes.
- Polling external metadata store on every write. Per-write network round-trip on the hot path; not viable at streaming throughput.
- GC state computed on demand from the data log. Forces a full log scan on every leadership change or every GC sweep. RSM-as-materialised-view is the answer.
Relationship to similar shapes¶
- Last reconciled offset (concepts/last-reconciled-offset) — sibling per-partition watermark in Cloud Topics, used for read routing instead of GC. Likely lives in the same Raft log via the same pattern.
- Kafka consumer group offset commits — a per-partition state written into a special internal topic. Same shape (per-partition state, durable, replicated) but in a different log substrate.
- Postgres
xmin/xmaxversioning — per-row visibility state alongside data. Same locality property (state with the data) but not RSM-shaped.
Seen in¶
- sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection
— canonical wiki instance. Per-partition RSM in each Cloud Topic
partition's Raft log; tracks
[max_applied_epoch, previous_applied_epoch, min_epoch_lower_bound];M(p) = prev(min_epoch_lower_bound)is the published per-partition safe-to-GC watermark; admission control rejects writes outside the[previous_applied, max_applied]window before they enter the replication pipeline.
Related¶
- concepts/cluster-epoch — the primitive the RSM tracks.
- concepts/epoch-based-distributed-gc — the parent technique this pattern is the per-shard half of.
- concepts/sliding-window-epoch-tracking — the specific RSM shape Cloud Topics uses.
- patterns/epoch-stamp-on-object-id-for-gc — sibling pattern: the stamping half of the GC technique.
- patterns/lazy-aggregate-from-monotonic-local-state — sibling pattern: the global-aggregation half.
- systems/redpanda-cloud-topics — the canonical system.