Skip to content

CONCEPT Cited by 1 source

Delta Change Data Feed

Definition

Delta Change Data Feed (CDF) is a Delta Lake capability that exposes the row-level change events — inserts, updates, deletes — that occurred between two versions of a Delta table, as a structured stream a downstream pipeline can consume. Conceptually it is change data capture for a Delta table, but instead of tailing a database transaction log it reads Delta's own commit history.

CDF is the layer-transition primitive for the Medallion architecture when the architect wants change-driven (rather than full-rescan) promotion from Bronze → Silver → Gold: a downstream Silver job can subscribe to "only the rows that changed since I last ran" on the Bronze table, dramatically cutting compute cost and latency for incremental workloads.

Why it matters

Without CDF, a Bronze→Silver promotion has two unattractive options:

  • Full rescan — read every Bronze row, recompute every Silver row. Correct but expensive at scale.
  • Bespoke watermark logic — store a high-watermark column, query rows above it. Brittle: doesn't capture deletes, fails on out-of-order updates, has to be maintained per-table.

CDF makes the change stream a first-class table feature: the pipeline asks Delta for "changes from version N to version M" and gets a structured DataFrame with the change type and the before/after row state. The promotion job becomes a streaming or micro-batch read against the change feed.

Composition with the audit chain

When CDF is paired with schema evolution and Delta time travel, the result is an unbreakable chain of custody between layers:

"From there, a promotion pipeline — reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema. By utilizing Delta Lake's schema evolution and time travel, Claroty maintains an unbreakable chain of custody; every asset record is traceable back to its original raw artifact and the specific mapping version that classified it." (Source: sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-library)

The interesting move is the mapping registry: the classifier logic that turns Bronze raw evidence into the Silver canonical schema is itself versioned. CDF tells the promotion pipeline which rows changed; time travel + schema evolution tells the auditor which mapping version classified each row. Both lineages — data and classifier — are queryable.

Properties

  • Append-only friendly. Bronze in the Medallion shape is typically append-only (raw events captured in the order they arrived). CDF on an append-only table produces a clean "new rows" stream — no update/delete bookkeeping required.
  • Schema-evolution-tolerant. When Bronze adds a column, CDF surfaces the new column in the change events; downstream Silver can choose to map it or ignore it. The promotion pipeline doesn't break.
  • Replayable from time travel. The combination CDF + time travel means a downstream consumer can ask "replay all changes from version 1234 onwards" — useful for reprocessing after a Silver-side bug, without re-ingesting from source.
  • Decouples producer and consumer cadence. The Bronze writer commits at its own rate; the Silver consumer drains the change feed at its own rate. No tight coupling.

Failure modes

  • CDF retention vs Delta retention. CDF is a derived representation of commit history; if the underlying Delta table's VACUUM removes old commit metadata, CDF for that range becomes inaccessible. Retention must be sized to the longest expected consumer lag.
  • Schema-evolution mismatch. A consumer pinned to schema v1 reading change events that include v2 columns will need to handle the new fields gracefully — same backward-compat problem as any schema-evolution story.
  • Classifier lineage not automatic. CDF tells you which rows changed but does not capture which mapping version classified them. The application must persist the classifier-version reference (e.g., as a column on the Silver table). Without it, the audit chain is half-complete.
  • CDF doesn't replace CDC from upstream. CDF reports changes inside the Delta table; if the source of those changes is an upstream OLTP database, the Bronze ingestion layer still needs its own CDC mechanism (Debezium, Kafka connector, etc.) to land changes into Bronze in the first place.

Adjacent concepts

  • concepts/change-data-capture — the broader pattern of emitting row-level changes from a stateful store; CDF is Delta's instantiation of CDC for itself.
  • concepts/medallion-architecture — CDF is the layer-transition primitive when promotions are change-driven rather than full-rescan.
  • concepts/schema-evolution · Delta time travel — the two Delta capabilities that compose with CDF to produce an end-to-end audit chain.

Seen in

  • sources/2026-05-23-databricks-scaling-for-mhhs-octopus-energy-50x-cost-reductionMulti-terabyte upstream-substrate face. Distinct from the Bronze→Silver promotion face below. The Octopus Energy MHHS margin-pipeline rebuild applies CDF to the unified multi-grain source-of-truth layer (the shared upstream that all three grain-aligned streams read from), turning what was a full-overwrite-on-every-run pattern into change-driven incremental processing. Verbatim: "Delta Lake's Change Data Feed (CDF) made true incremental processing viable at this grain. Instead of complete overwrites, the pipeline now reads only records that have actually changed since the last run. The result: rows processed per run dropped from 25 billion to 300 million — a 98.8% reduction. Data freshness improved from weekly to daily." Single highest-leverage optimisation of the rebuild; the headline $1M annualised cost avoidance excludes this upstream-incremental win ("the full efficiency gain is larger"). Canonicalised as patterns/cdf-incremental-replacing-full-rescan.

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryCanonical wiki source for CDF as a layer-transition primitive for ER pipelines. Claroty's CPS Library uses CDF to drive the Bronze→Silver promotion: "raw, heterogeneous JSON payloads" land in append-only Bronze Delta tables, and "a promotion pipeline — reading from Delta Change Data Feed (CDF) — dynamically applies a mapping registry to transform raw evidence into a governed, canonical schema." The mapping registry is itself versioned via Delta schema evolution + time travel, so every CPS-ID is traceable to both the raw artifact and the specific mapping version that produced the canonical record.

Last updated · 542 distilled / 1,571 read