CONCEPT Cited by 1 source
Snapshot-diff inference CDC¶
Snapshot-diff inference CDC is the class of CDC ingestion where the upstream source does not emit a native change data feed; it emits periodic whole-table snapshots, and the CDC consumer infers inserts / updates / deletes by comparing consecutive snapshots.
Why it matters¶
Not every source system has a native change-log mechanism. Common reasons:
- The source database doesn't support CDC (or the team doesn't control the upstream database and can't enable it).
- The source is a vendor export (e.g. a partner's daily CSV / JSON dump).
- The source is an append-heavy table where the consumer only sees a full refresh.
In these cases, teams traditionally hand-roll snapshot-diff logic:
load both snapshots into staging tables, compute LEFT ANTI JOIN for
deletes, INNER JOIN + value-comparison for updates, LEFT ANTI
JOIN the other direction for inserts, then apply deltas with
MERGE INTO. This is the most error-prone flavour of hand-rolled
CDC — deletion inference is especially subtle (is a missing row a
delete, or a snapshot that was filtered upstream?), and the memory
footprint of diffing two full snapshots at TB scale is a standing
concern.
The declarative alternative¶
AutoCDC inside Lakeflow Spark Declarative Pipelines treats snapshot-diff inference as a first- class input mode:
"AutoCDC treats snapshot-based CDC as a first-class pattern, automatically detecting row-level changes between snapshots and applying them incrementally without requiring custom diff logic or state management." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
The pipeline author presents snapshots; the runtime derives deltas and applies them under the declared SCD type. For SCD Type 2, this means closing out previously-active rows and inserting new versions with updated validity windows, without requiring multi-step MERGE logic or custom snapshot state tracking.
Distinguishing from the other CDC ingest modes¶
This wiki canonicalises three distinct CDC ingest shapes:
| Mode | Source surface | Example |
|---|---|---|
| Log-based capture | Native change log mined from the source DB's storage/replication layer | Postgres logical replication, MySQL binlog, MongoDB change streams, Spanner change streams, MSSQL change tables, Oracle LogMiner |
| Change data feed (CDF) | Source emits per-row operation + sequencing as a stream | Databricks Delta CDF, Kafka Connect CDC topics |
| Snapshot-diff inference | Source emits periodic whole-table snapshots; consumer infers deltas | Vendor exports, non-CDC-capable upstream DBs, full-refresh pipelines |
The three modes have different correctness properties. Log-based capture and CDF have deterministic, log-ordered semantics; snapshot- diff is inference-based and must make assumptions about missing-row semantics that the other two do not.
Hard problems snapshot-diff runtimes must solve¶
Even with a declarative API, snapshot-diff inference is not free — the runtime has to make choices that affect correctness:
- Deletion semantics. Missing row = delete, or filtered? If the
upstream snapshot applies a
WHEREfilter that elides some rows, a naive implementation deletes them downstream. AutoCDC's policy on this is not disclosed in the source post. - Schema drift between snapshots. If snapshot N+1 has a column snapshot N didn't, the runtime must decide whether that's a late- arriving column value (backfill) or a new column (evolve).
- Memory footprint at TB scale. Diffing two 1-TB snapshots naively requires holding both in memory; production implementations spill to disk, stream-sort by key, or use bloom filters.
- Ordering across multiple snapshots arriving out of sequence. If snapshots N, N+1, N+2 arrive in order N+2, N, N+1, the runtime must re-establish logical order before emitting deltas — conceptually similar to out-of-sequence CDC event handling but at snapshot-granularity rather than event- granularity.
Seen in¶
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — Databricks canonicalises snapshot-diff inference as a first-class AutoCDC input mode alongside native CDF sources. Quote from Fortune 500 Aerospace & Defense adopter: "I tried AutoCDC from Snapshots in Python and was amazed at how 4 lines of code could replace what I was doing in 1,500 lines of code before." — the concrete-code- reduction evidence for why declarative snapshot-diff wins over hand-rolled in production. First wiki source to pin snapshot-diff inference as a named CDC ingest mode.