Skip to content

CONCEPT Cited by 2 sources

Tombstone (deletion marker)

A tombstone is a special marker written in place of a deleted record in an eventually-consistent or append-only store. It says "this key used to exist; do not resurrect it" — without which, a gossip or LSM-tree replication protocol could re-insert a "deleted" value from a lagging replica.

Why deletes can't just delete

In a gossip-replicated store (Cassandra, Dynamo, Riak), replicas receive updates at different times. If node A processes DELETE(k), node B is partitioned, and during the partition node C gossips its older copy of k to A, then A has no way to tell the resurrected key from a brand-new write — and reinserts it.

Tombstones fix this by making the delete a first-class value in the causal order: once A has the tombstone, any older-version write for k loses the (generation, version) merge.

"The tombstone is a special entry to invalidate the data entries that have a matching key without actual deletion of the data. The gossip protocol deletes the data from a node using a tombstone." (Source: sources/2023-07-16-highscalability-gossip-protocol-explained)

Tombstone lifecycle

  1. Write — an UPDATE or PUT with a delete marker and a timestamp ≥ any concurrent write it must shadow.
  2. Gossip propagation — the tombstone is just another versioned key-value; it replicates the same way live values do.
  3. Compaction / GC — tombstones can be physically evicted only after a grace period long enough that every replica has definitely seen them. (Cassandra's gc_grace_seconds, default 10 days — if a partitioned node comes back after this window, it can zombie-resurrect deleted data.)

Costs

  • Storage amplification — tombstones accumulate until GC fires.
  • Read amplification — scans across a key range must filter tombstoned entries (Cassandra's infamous "tombstone warnings" in slow queries).
  • Compaction work — LSM compactors must carry tombstones forward through level merges until the grace period elapses.

Tombstone alternatives in CRDTs

CRDT designs often encode deletion differently — as a negative increment (counter CRDTs), a removal set (OR-sets), or a causal context (epoch-marked writes). These avoid explicit tombstones but pay in metadata instead. See concepts/crdt for the broader design space; systems/cr-sqlite (underneath Fly's Corrosion) uses per-row causal metadata rather than a tombstone column.

Seen in

Last updated · 319 distilled / 1,201 read