SYSTEM Cited by 1 source
AutoCDC¶
AutoCDC is Databricks' declarative Change Data Capture API inside
Lakeflow Spark
Declarative Pipelines. Pipeline authors declare the semantics
they want (keys, sequence column, delete predicate, SCD type); the
runtime implements ordering, deduplication, history maintenance, late-
arriving-data handling, and reprocessing safety. It replaces
hand-rolled MERGE logic (typically 40–200+ lines with staging
tables, window functions, sequencing assumptions) with ~6–10 lines of
declarative definition per pipeline.
API surface¶
The primary entry point is dp.create_auto_cdc_flow from the
pyspark.pipelines module (imported as dp). Example for
SCD Type 1 from the source post:
from pyspark import pipelines as dp
from pyspark.sql.functions import col, expr
@dp.view
def users():
return spark.readStream.table("cdc_data.users")
dp.create_streaming_table("target")
dp.create_auto_cdc_flow(
target="target",
source="users",
keys=["userId"],
sequence_by=col("sequenceNum"),
apply_as_deletes=expr("operation = 'DELETE'"),
stored_as_scd_type=1
)
SCD Type 2 changes one parameter — stored_as_scd_type=2 — and the
runtime manages __START_AT / __END_AT version columns,
automatically closing out active rows and inserting new versions in
the correct sequence.
(Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
Parameters as semantic declarations¶
| Parameter | Role |
|---|---|
target |
The output streaming table (usually a Delta table). |
source |
The input view (a streaming table, CDF, or snapshot source). |
keys |
Primary key column(s) for change matching. |
sequence_by |
The ordering column that defines logical event order, independent of arrival order. Load-bearing primitive for out-of-sequence CDC event handling. |
apply_as_deletes |
Predicate identifying delete events (e.g. operation = 'DELETE'). |
stored_as_scd_type |
1 for overwrite-in-place (current state only); 2 for row-versioned history with __START_AT / __END_AT validity windows. |
Each parameter collapses what would otherwise be a block of hand-rolled MERGE-plus-window-function logic.
Input modes¶
AutoCDC handles three distinct input shapes, all via the same API surface:
- Change Data Feed (CDF) source — the input stream includes
per-row
operation(INSERT / UPDATE / DELETE) and a sequencing column. - CDF source, SCD Type 2 — as above, but
stored_as_scd_type=2triggers history-table maintenance with versioned rows. - Snapshot source — the input is a sequence of whole-table snapshots. The runtime computes row-level inserts / updates / deletes between snapshots and applies them incrementally; no hand- rolled diff logic required. Canonicalised on the wiki as snapshot-diff inference CDC.
Properties inherited from Lakeflow SDP¶
AutoCDC composes with the Lakeflow SDP runtime; the semantic guarantees below come from SDP, not AutoCDC itself:
- Incremental progress tracking across restarts.
- Out-of-sequence arrival handling (AutoCDC uses
sequence_byto re-establish logical order). - Reprocessing safety — historical data can be replayed without double-applying.
- Schema evolution — upstream column additions do not break the pipeline.
- Failure recovery without losing changes.
"Lakeflow Spark Declarative Pipelines automatically tracks incremental progress and handles out-of-sequence data. Pipelines can recover from failures, reprocess historical data, and evolve over time without double-applying or losing changes." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
Relationship to MERGE INTO¶
AutoCDC does not replace MERGE INTO; it layers on top. The
source post preserves this explicitly:
"While MERGE INTO remains a foundational Spark primitive, AutoCDC builds on it to handle out-of-sequence data and incremental processing more efficiently as data volumes grow."
The win is at the authoring layer — pipeline authors declare semantics once, the runtime chooses the optimal MERGE implementation, and Databricks Runtime-level optimisations (e.g. the Nov 2025 improvements disclosed in the post) propagate to all AutoCDC pipelines automatically. Hand-rolled MERGE with bespoke window functions does not benefit.
Performance disclosure (vendor-measured)¶
Databricks Runtime improvements disclosed in the 2026-04-22 post, measured since November 2025:
- SCD Type 1: 71% better performance per dollar.
- SCD Type 2: 96% better performance per dollar.
No absolute throughput (rows/sec), latency (p50/p99), or cost-per-TB numbers are disclosed. The figures are Databricks-measured and Databricks-disclosed; external reproduction not published.
Named adopters¶
| Adopter | Use case | Quote |
|---|---|---|
| Navy Federal Credit Union | Large-scale real-time event processing, billions of application events/day | "The simplicity of the Spark Declarative Pipelines programming model combined with its service capabilities resulted in an incredibly fast turnaround time." (Jian Zhou, Senior Engineering Manager) |
| Block | Streaming pipelines on Delta Lake | "The time required to define and develop a streaming pipeline has gone from days to hours." (Yue Zhang, Staff SWE, Data Foundations) |
| Valora Group | Swiss foodvenience retail analytics, master data CDC | "We gained a lot by doing CDC in SDP, because you don't write any code — it's all abstracted in the background. AutoCDC minimizes the number of lines… it's so easy to do." (Alexane Rose, Data and AI Architect) |
All three are in regulated verticals (banking, payments, retail), where CDC correctness is load-bearing.
Why it matters for sysdesign¶
AutoCDC canonicalises the declarative API surface for CDC — a
problem class (MERGE logic for out-of-order updates, deletes,
late-arriving data, idempotency) that was historically always
hand-rolled on Spark/Iceberg/Delta. The wiki's
declarative-vs-
imperative stream API axis gains a new instance: unlike Zalando's
Flink-SQL → DataStream-API rewrite (where declarative lost on the
10% of state-amplifying workloads), the CDC workload is squarely in
the 90%-declarative-wins camp. The architectural move is to bound
the correctness envelope of the API so Runtime-level improvements
can be applied universally.
Caveats¶
- Declarative boundary not formally specified. The post doesn't name which CDC workloads don't fit AutoCDC — which 10% the abstraction leaks on. Candidates: multi-column sequencing, tombstone-vs-missing deletion ambiguity in snapshot mode, conflicting per-key SCD-type policies in the same table.
- Snapshot-diff semantics underspecified in public docs. How deletion is inferred (missing row? tombstone?), how schema drift across snapshots is handled, what memory footprint full-snapshot diffing has at TB scale — not disclosed in the post.
- Vendor-gated. Databricks-only; no open-source runtime equivalent. Migration away requires returning to hand-rolled MERGE or re-implementing sequencing logic in another declarative pipeline system.
Seen in¶
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines
— first and only wiki source on AutoCDC. Databricks' product-
engineering post positions AutoCDC as the declarative alternative
to hand-rolled
MERGEfor CDC and SCD pipelines, with three input modes (CDF SCD1, CDF SCD2, snapshot-diff), four named semantic parameters (keys,sequence_by,apply_as_deletes,stored_as_scd_type), and 71% / 96% perf-per-dollar improvements disclosed for the respective SCD types since Nov 2025. Named adopters: Navy Federal Credit Union, Block, Valora Group. First canonical wiki system page for a declarative CDC API surface.
Related¶
- systems/lakeflow-spark-declarative-pipelines — host runtime
- systems/delta-lake — typical storage target
- systems/databricks — parent platform
- systems/databricks-genie-code — AI-assisted client generating AutoCDC declarations
- systems/apache-spark — underlying compute engine; MERGE INTO as the underlying primitive
- concepts/change-data-capture — parent concept
- concepts/slowly-changing-dimension — parent concept
- concepts/snapshot-diff-inference-cdc — input mode
- concepts/out-of-sequence-cdc-event-handling — semantic primitive
- patterns/declarative-cdc-over-hand-rolled-merge — the canonical pattern AutoCDC embodies
- patterns/merge-into-over-insert-overwrite — the prior-art hand-rolled pattern AutoCDC displaces at the authoring layer