CONCEPT Cited by 1 source
Row tracking¶
Row tracking is the table-format property of giving every row a stable, persistent identifier that survives table-level rewrites — compaction, schema evolution, file repositioning. Without row tracking, a row's "identity" is implicitly its (file, position) tuple, which changes whenever the storage layer rewrites the data.
Row tracking is the substrate primitive for efficient incremental processing: a downstream consumer of a table — a CDC pipeline, an incremental materialized-view refresh, an ML feature recomputation job — needs to know "which rows changed between commit A and commit B". Without stable row IDs, the consumer cannot tell the difference between a row that genuinely changed and a row that was merely rewritten by compaction with the same logical content.
The problem row tracking solves¶
Consider a table with 1 billion rows. A compaction job rewrites the underlying data files for storage efficiency without changing any logical row content. From the perspective of a naive incremental consumer that watches the snapshot diff:
- Without row tracking: the diff says "all 1 billion rows are different" (because every row's (file, position) coordinate changed). The consumer must re-process the entire table.
- With row tracking: the diff says "zero rows changed; this was a compaction". The consumer skips the no-op compaction commit entirely.
This generalises to any commit that touches more files than necessary: schema-evolution commits that rewrite some files for layout, partial UPDATE commits that touch a few files of a table, MERGE commits where most "matched" rows weren't actually modified. Without row tracking, the consumer over-reports change. With row tracking, the consumer sees only the genuinely changed rows.
Why it's an Iceberg v3 milestone¶
Delta Lake has had analogues of row tracking via the Delta Change Data Feed for years; Iceberg v3 brings parity by introducing row tracking at the spec level on the Iceberg side. From the announcing source:
"row tracking supports more efficient incremental processing"
The architectural significance is twofold:
- Cross-format consumer parity. A consumer of an Iceberg-via-UniForm table now sees the same incremental-processing primitive it would see from a Delta table. This is part of the "these features also work seamlessly across both Delta and Iceberg tables" alignment.
- Unlocks materialized-view-as-Iceberg-table substrate. The same announcement introduces materialized views that expose as Iceberg tables. Incremental refresh of those MVs depends on stable row identity in the source tables — which row tracking provides.
Use cases enabled¶
- CDC pipelines that consume an OTF table as their source no longer over-emit on compaction commits. (Compare patterns/cdf-incremental-replacing-full-rescan for the Delta-side analogue applied to Octopus Energy's margin pipeline.)
- Materialized-view incremental refresh can be tractable on tables that experience frequent compaction — without row tracking, every compaction would invalidate the MV's incremental-refresh state.
- ML feature recomputation triggered by row-level changes (not file-level changes) is now well-defined. Pinterest-style user-sequence pipelines that consume an OTF table as their event source benefit directly.
- Audit / lineage trails at row granularity become feasible — a row's lineage history is anchored by its stable identifier rather than its ephemeral (file, position).
Caveats¶
- Spec-level details deferred. The announcing source names the feature but does not specify the row-ID encoding, generation strategy (sequential / hash / UUID), or migration path for tables that pre-date row tracking. See the Iceberg v3 spec for protocol detail.
- Storage overhead. Stable per-row IDs add per-row metadata; the announcement discloses no overhead numbers.
- Engine-side support varies. The row-tracking feature is GA on Databricks; other Iceberg engines need v3-aware readers to benefit.
- Interaction with deletion vectors and VARIANT undisclosed. How a row's tracked identity interacts with deletion-vector marking, schema-evolved columns, or VARIANT-typed fields is not addressed in the announcing source.
Seen in¶
- sources/2026-05-28-databricks-advancing-apache-iceberg-on-databricks-iceberg-v3-ga-open-sharing-and-unified-governance — GA announcement of Iceberg v3 row tracking on Databricks. The feature is named alongside deletion vectors and VARIANT as one of three v3 primitives; the "more efficient incremental processing" role is stated; no mechanism depth.
Related¶
- systems/iceberg-v3 — v3 milestone introducing row tracking to the Iceberg spec.
- systems/apache-iceberg — parent table format.
- systems/delta-lake — sibling format; CDF is its incremental-processing substrate.
- concepts/delta-change-data-feed — Delta's row-level incremental-processing primitive.
- concepts/change-data-capture — the umbrella consumer pattern row tracking supports.
- concepts/cdf-incremental-replacing-full-rescan — Delta-side application of incremental processing on a multi-terabyte substrate.
- concepts/open-table-format — umbrella concept.