CONCEPT Cited by 2 sources
Tombstone (deletion marker)¶
A tombstone is a special marker written in place of a deleted record in an eventually-consistent or append-only store. It says "this key used to exist; do not resurrect it" — without which, a gossip or LSM-tree replication protocol could re-insert a "deleted" value from a lagging replica.
Why deletes can't just delete¶
In a gossip-replicated store (Cassandra, Dynamo, Riak), replicas receive updates at different times. If node A processes DELETE(k), node B is partitioned, and during the partition node C gossips its older copy of k to A, then A has no way to tell the resurrected key from a brand-new write — and reinserts it.
Tombstones fix this by making the delete a first-class value in the causal order: once A has the tombstone, any older-version write for k loses the (generation, version) merge.
"The tombstone is a special entry to invalidate the data entries that have a matching key without actual deletion of the data. The gossip protocol deletes the data from a node using a tombstone." (Source: sources/2023-07-16-highscalability-gossip-protocol-explained)
Tombstone lifecycle¶
- Write — an
UPDATEorPUTwith a delete marker and a timestamp ≥ any concurrent write it must shadow. - Gossip propagation — the tombstone is just another versioned key-value; it replicates the same way live values do.
- Compaction / GC — tombstones can be physically evicted only after a grace period long enough that every replica has definitely seen them. (Cassandra's
gc_grace_seconds, default 10 days — if a partitioned node comes back after this window, it can zombie-resurrect deleted data.)
Costs¶
- Storage amplification — tombstones accumulate until GC fires.
- Read amplification — scans across a key range must filter tombstoned entries (Cassandra's infamous "tombstone warnings" in slow queries).
- Compaction work — LSM compactors must carry tombstones forward through level merges until the grace period elapses.
Tombstone alternatives in CRDTs¶
CRDT designs often encode deletion differently — as a negative increment (counter CRDTs), a removal set (OR-sets), or a causal context (epoch-marked writes). These avoid explicit tombstones but pay in metadata instead. See concepts/crdt for the broader design space; systems/cr-sqlite (underneath Fly's Corrosion) uses per-row causal metadata rather than a tombstone column.
Seen in¶
- sources/2023-07-16-highscalability-gossip-protocol-explained — definitional source within the gossip-protocol framework.
- sources/2024-09-19-netflix-netflixs-key-value-data-abstraction-layer — Netflix KV DAL names Cassandra's item-level-delete tombstone cost as a primary motivation for its TTL-with-jitter mitigation: "some storage engines (any store which defers true deletion) such as Cassandra struggle with high volumes of deletes due to tombstone and compaction overhead." Record-level and range deletes on KV DAL explicitly collapse to one tombstone; only item-level deletes fall back to TTL-jitter. Introduces the doomstone failure mode: a write with a future-dated idempotency-token timestamp on a store sensitive to future timestamps becomes an immutable "deletion record in the future" that blocks legitimate writes — KV DAL's server-side drift rejection is the guard.
Related¶
- concepts/gossip-protocol
- concepts/eventual-consistency
- concepts/crdt
- concepts/lsm-compaction
- concepts/ttl-based-deletion-with-jitter — the Netflix-KV-DAL mitigation for high-volume item-level-delete tombstone load.
- concepts/idempotency-token — future-dated tokens create doomstones on stores that treat deletion markers with future timestamps as un-overwriteable.
- systems/apache-cassandra
- systems/netflix-kv-dal