CONCEPT Cited by 2 sources
Garbage collection (storage)¶
Definition¶
In immutable / append-only storage systems, garbage collection (GC) is the stage that identifies which blobs, rows, or objects are no longer referenced and marks them safe to remove. GC does not free disk space on its own — that's the job of compaction.
The separation is explicit in Magic Pocket's framing:
Garbage collection identifies blobs that are no longer referenced and marks them as safe to remove, but it does not free space on its own. Compaction performs the physical reclamation. Because volumes cannot be modified once closed, we gather the live blobs from volumes, write them into new volumes, and retire the old ones. This is how deletes eventually translate into reusable space. (Source: sources/2026-04-02-dropbox-magic-pocket-storage-efficiency-compaction)
Why the two-stage split¶
- Reference tracking (GC) is a metadata-layer question — often requires traversing catalog state, user-facing reference tables, retention policies, snapshot holds, etc. Slow, correctness-critical, happens asynchronously.
- Physical reclamation (compaction) is a storage-layer question — rewrite live data, retire the old unit. Expensive in I/O, throttled, batched.
Coupling them would force every delete to pay the full reclamation cost synchronously, which is infeasible at the delete rates real systems see (Magic Pocket: millions per day). Decoupling lets both stages be tuned independently:
- GC cadence tracks reference-graph staleness tolerance.
- Compaction cadence tracks overhead tolerance.
Canonical instantiations¶
- Magic Pocket — GC marks blobs deleted; compaction (L1/L2/L3) physically reclaims volumes once enough marked-deleted blobs accumulate.
- LSM databases — tombstones mark deleted keys; compaction drops them during merges.
- Open table formats (Iceberg, Delta Lake) — manifest GC invalidates old Parquet files; physical delete is a separate managed-compaction step.
- Redpanda Cloud Topics L0
garbage collection — the decision of which intermediate L0
objects in object storage are safe to delete is made via
epoch-based distributed
GC rather than reference counting: every L0 object is stamped
at creation with a cluster epoch (a
monotonically increasing global counter), each partition tracks
a per-partition safe-to-GC watermark
M(p)in a Raft-replicated state machine using sliding-window epoch tracking, and the clusterwide safe-to-GC epochM = min(M(p))is computed lazily from disseminated per-partition watermarks. The post explicitly rejects distributed reference counting as introducing "an ocean of complexity" and emphasises "no central index, no shared state, and no coordinated updates" as the architectural property the technique delivers. (Source: sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection)
Reference counting vs epoch-based — the canonical distributed-GC choice¶
The Cloud Topics L0 GC post is the wiki's first explicit canonicalisation of reference counting vs epoch-based GC as a distributed-systems design choice:
| Reference counting | Epoch-based GC | |
|---|---|---|
| Question | "Does any consumer still reference this specific object?" | "Has the cluster moved past this epoch?" |
| Per-object state | The reference count | None (only the immutable creation stamp) |
| Update path | Counts updated as references are added/removed | No update — stamp is fixed |
| Coordination needs | Durable + linearizable count updates from anywhere | None — just dissemination of per-shard monotonic watermarks |
| Granularity | Per object | Per epoch cohort |
| Reclamation latency | Low (delete on count → 0) |
Higher (wait for epoch cohort) |
| Complexity at scale | High (the "ocean of complexity") | Low |
When the "is this safe to delete?" question is being answered across a distributed system without a central index, the trade-off is reclamation latency for coordination cost. Cloud Topics canonicalises picking the latency-cost side.
Seen in¶
- sources/2026-04-02-dropbox-magic-pocket-storage-efficiency-compaction — canonical two-stage statement (GC marks, compaction frees).
- sources/2026-05-19-redpanda-cloud-topics-level-zero-garbage-collection
— canonical distributed-GC instance using
epoch-based GC instead
of reference counting; cluster-epoch-stamped L0 objects + per-
partition Raft RSM watermarks + lazy
minaggregation; "no central index, no shared state, and no coordinated updates."