CONCEPT Cited by 1 source
Deletion vector¶
A deletion vector is a compact file-level representation of "these row positions in this data file are logically deleted", used by lakehouse table formats to mark rows as absent without rewriting the underlying data file.
When a row is deleted from a table that uses deletion vectors, the storage layer:
- Identifies the data file(s) containing the row(s) and their position(s) within those files.
- Writes (or updates) a small deletion vector file that records the deleted row positions — typically as a roaring bitmap or similar compact set encoding.
- Updates the table snapshot to associate the data file with its deletion vector.
- Leaves the original data file untouched.
At read time, the engine fetches both the data file and its associated deletion vector and applies the vector to skip the marked rows.
This is the alternative to copy-on-write for row-level deletes — see concepts/copy-on-write-merge. CoW would require rewriting the entire data file just to remove a few rows; the deletion vector reduces that to a small bitmap write.
Why it matters¶
- Drastic reduction in write amplification for sparse deletes. Deleting 100 rows from a 1 GB Parquet file under copy-on-write means rewriting roughly 1 GB of data; under deletion vectors, it means writing a few-KB bitmap. For UPDATE / MERGE workloads where many rows in many large files are touched, the savings compound.
- Low-latency deletes / updates / merges. Operations that previously required heavy file-rewriting can now commit at metadata speed.
- Compaction-friendly. The deletion vector + data file pair can be coalesced asynchronously by a compaction job that produces a new clean file with the deletions absorbed — moving the rewrite cost off the foreground write path entirely.
The merge-on-read read-side cost¶
Deletion vectors are a form of concepts/merge-on-read: the read side has to do extra work to materialise the table's logical state. Specifically:
- Each scan must fetch and decode the deletion vector for every relevant data file.
- The engine must apply the bitmap to each data file's rows during read.
- File-level metadata (row counts, statistics) must be interpreted with the vector applied — naive count(*) over the data file alone overcounts.
For most analytical workloads this read-side cost is small; for very high-frequency scans of files with large deletion vectors, periodic compaction is required to prevent vector growth from degrading read performance.
Iceberg v3 vs Delta Lake¶
Both Apache Iceberg and Delta Lake support deletion vectors:
- Delta Lake introduced deletion vectors first; they have been part of the Delta protocol since 2023.
- Iceberg v3 adds deletion vectors to the Iceberg spec; reached GA on Databricks on 2026-05-28.
The 2026-05-28 announcement explicitly emphasises cross-format interoperability: "deletion vectors accelerate updates, merges, and deletes; row tracking supports more efficient incremental processing; and VARIANT provides a standard representation for semi-structured data. These features also work seamlessly across both Delta and Iceberg tables, enabling interoperability without rewriting data." (Source: sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance)
The cross-format interoperability is non-trivial: it means a deletion vector written by a Delta engine can be interpreted by an Iceberg engine reading the same physical files (under UniForm or similar bridges), and vice versa. This is what makes the joint Iceberg-v4 + Delta-5.0 format co-evolution direction architecturally tractable.
Caveats¶
- On-disk format details not in the announcing source. The 2026-05-28 announcement names deletion vectors as a v3 feature but does not document the on-disk format (bitmap encoding, file naming, snapshot-association mechanism). Refer to the Iceberg v3 spec and Delta's deletion-vector spec for details.
- Compaction policy interaction undisclosed. The announcement does not specify how Predictive Optimization decides when to compact files with deletion vectors into clean rewritten files.
- No quantitative numbers. No write-amp / read-amp / storage-overhead benchmarks in the source.
- Position-delete vs equality-delete. Iceberg historically also supported "equality deletes" (delete-by-predicate-match). Deletion vectors are position-based and are typically a strict improvement; the relationship between position-deletes and equality-deletes in v3 is not addressed in the announcing source.
Seen in¶
- sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance — GA disclosure for Iceberg v3 deletion vectors on Databricks. Named as one of the three v3 features that "close important gaps between performance and interoperability"; cross-format compatibility with Delta deletion vectors disclosed verbatim. No mechanism depth.
Related¶
- systems/iceberg-v3 — the v3 milestone in which Iceberg adds deletion vectors.
- systems/apache-iceberg — parent table format.
- systems/delta-lake — sibling format with deletion-vector support.
- concepts/merge-on-read — the family of update strategies deletion vectors implement.
- concepts/copy-on-write-merge — the alternative strategy deletion vectors avoid for sparse deletes.
- concepts/open-table-format — umbrella concept.
- concepts/format-co-evolution-iceberg-delta — the broader Iceberg-v4 + Delta-5.0 alignment direction.