Skip to content

CONCEPT Cited by 1 source

Snapshot-diff inference CDC

Snapshot-diff inference CDC is the class of CDC ingestion where the upstream source does not emit a native change data feed; it emits periodic whole-table snapshots, and the CDC consumer infers inserts / updates / deletes by comparing consecutive snapshots.

Why it matters

Not every source system has a native change-log mechanism. Common reasons:

  • The source database doesn't support CDC (or the team doesn't control the upstream database and can't enable it).
  • The source is a vendor export (e.g. a partner's daily CSV / JSON dump).
  • The source is an append-heavy table where the consumer only sees a full refresh.

In these cases, teams traditionally hand-roll snapshot-diff logic: load both snapshots into staging tables, compute LEFT ANTI JOIN for deletes, INNER JOIN + value-comparison for updates, LEFT ANTI JOIN the other direction for inserts, then apply deltas with MERGE INTO. This is the most error-prone flavour of hand-rolled CDC — deletion inference is especially subtle (is a missing row a delete, or a snapshot that was filtered upstream?), and the memory footprint of diffing two full snapshots at TB scale is a standing concern.

The declarative alternative

AutoCDC inside Lakeflow Spark Declarative Pipelines treats snapshot-diff inference as a first- class input mode:

"AutoCDC treats snapshot-based CDC as a first-class pattern, automatically detecting row-level changes between snapshots and applying them incrementally without requiring custom diff logic or state management." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

The pipeline author presents snapshots; the runtime derives deltas and applies them under the declared SCD type. For SCD Type 2, this means closing out previously-active rows and inserting new versions with updated validity windows, without requiring multi-step MERGE logic or custom snapshot state tracking.

Distinguishing from the other CDC ingest modes

This wiki canonicalises three distinct CDC ingest shapes:

Mode Source surface Example
Log-based capture Native change log mined from the source DB's storage/replication layer Postgres logical replication, MySQL binlog, MongoDB change streams, Spanner change streams, MSSQL change tables, Oracle LogMiner
Change data feed (CDF) Source emits per-row operation + sequencing as a stream Databricks Delta CDF, Kafka Connect CDC topics
Snapshot-diff inference Source emits periodic whole-table snapshots; consumer infers deltas Vendor exports, non-CDC-capable upstream DBs, full-refresh pipelines

The three modes have different correctness properties. Log-based capture and CDF have deterministic, log-ordered semantics; snapshot- diff is inference-based and must make assumptions about missing-row semantics that the other two do not.

Hard problems snapshot-diff runtimes must solve

Even with a declarative API, snapshot-diff inference is not free — the runtime has to make choices that affect correctness:

  • Deletion semantics. Missing row = delete, or filtered? If the upstream snapshot applies a WHERE filter that elides some rows, a naive implementation deletes them downstream. AutoCDC's policy on this is not disclosed in the source post.
  • Schema drift between snapshots. If snapshot N+1 has a column snapshot N didn't, the runtime must decide whether that's a late- arriving column value (backfill) or a new column (evolve).
  • Memory footprint at TB scale. Diffing two 1-TB snapshots naively requires holding both in memory; production implementations spill to disk, stream-sort by key, or use bloom filters.
  • Ordering across multiple snapshots arriving out of sequence. If snapshots N, N+1, N+2 arrive in order N+2, N, N+1, the runtime must re-establish logical order before emitting deltas — conceptually similar to out-of-sequence CDC event handling but at snapshot-granularity rather than event- granularity.

Seen in

  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — Databricks canonicalises snapshot-diff inference as a first-class AutoCDC input mode alongside native CDF sources. Quote from Fortune 500 Aerospace & Defense adopter: "I tried AutoCDC from Snapshots in Python and was amazed at how 4 lines of code could replace what I was doing in 1,500 lines of code before." — the concrete-code- reduction evidence for why declarative snapshot-diff wins over hand-rolled in production. First wiki source to pin snapshot-diff inference as a named CDC ingest mode.
Last updated · 517 distilled / 1,221 read