Skip to content

SYSTEM Cited by 1 source

AutoCDC

AutoCDC is Databricks' declarative Change Data Capture API inside Lakeflow Spark Declarative Pipelines. Pipeline authors declare the semantics they want (keys, sequence column, delete predicate, SCD type); the runtime implements ordering, deduplication, history maintenance, late- arriving-data handling, and reprocessing safety. It replaces hand-rolled MERGE logic (typically 40–200+ lines with staging tables, window functions, sequencing assumptions) with ~6–10 lines of declarative definition per pipeline.

API surface

The primary entry point is dp.create_auto_cdc_flow from the pyspark.pipelines module (imported as dp). Example for SCD Type 1 from the source post:

from pyspark import pipelines as dp
from pyspark.sql.functions import col, expr

@dp.view
def users():
    return spark.readStream.table("cdc_data.users")

dp.create_streaming_table("target")

dp.create_auto_cdc_flow(
    target="target",
    source="users",
    keys=["userId"],
    sequence_by=col("sequenceNum"),
    apply_as_deletes=expr("operation = 'DELETE'"),
    stored_as_scd_type=1
)

SCD Type 2 changes one parameter — stored_as_scd_type=2 — and the runtime manages __START_AT / __END_AT version columns, automatically closing out active rows and inserting new versions in the correct sequence.

(Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

Parameters as semantic declarations

Parameter Role
target The output streaming table (usually a Delta table).
source The input view (a streaming table, CDF, or snapshot source).
keys Primary key column(s) for change matching.
sequence_by The ordering column that defines logical event order, independent of arrival order. Load-bearing primitive for out-of-sequence CDC event handling.
apply_as_deletes Predicate identifying delete events (e.g. operation = 'DELETE').
stored_as_scd_type 1 for overwrite-in-place (current state only); 2 for row-versioned history with __START_AT / __END_AT validity windows.

Each parameter collapses what would otherwise be a block of hand-rolled MERGE-plus-window-function logic.

Input modes

AutoCDC handles three distinct input shapes, all via the same API surface:

  1. Change Data Feed (CDF) source — the input stream includes per-row operation (INSERT / UPDATE / DELETE) and a sequencing column.
  2. CDF source, SCD Type 2 — as above, but stored_as_scd_type=2 triggers history-table maintenance with versioned rows.
  3. Snapshot source — the input is a sequence of whole-table snapshots. The runtime computes row-level inserts / updates / deletes between snapshots and applies them incrementally; no hand- rolled diff logic required. Canonicalised on the wiki as snapshot-diff inference CDC.

Properties inherited from Lakeflow SDP

AutoCDC composes with the Lakeflow SDP runtime; the semantic guarantees below come from SDP, not AutoCDC itself:

  • Incremental progress tracking across restarts.
  • Out-of-sequence arrival handling (AutoCDC uses sequence_by to re-establish logical order).
  • Reprocessing safety — historical data can be replayed without double-applying.
  • Schema evolution — upstream column additions do not break the pipeline.
  • Failure recovery without losing changes.

"Lakeflow Spark Declarative Pipelines automatically tracks incremental progress and handles out-of-sequence data. Pipelines can recover from failures, reprocess historical data, and evolve over time without double-applying or losing changes." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

Relationship to MERGE INTO

AutoCDC does not replace MERGE INTO; it layers on top. The source post preserves this explicitly:

"While MERGE INTO remains a foundational Spark primitive, AutoCDC builds on it to handle out-of-sequence data and incremental processing more efficiently as data volumes grow."

The win is at the authoring layer — pipeline authors declare semantics once, the runtime chooses the optimal MERGE implementation, and Databricks Runtime-level optimisations (e.g. the Nov 2025 improvements disclosed in the post) propagate to all AutoCDC pipelines automatically. Hand-rolled MERGE with bespoke window functions does not benefit.

Performance disclosure (vendor-measured)

Databricks Runtime improvements disclosed in the 2026-04-22 post, measured since November 2025:

  • SCD Type 1: 71% better performance per dollar.
  • SCD Type 2: 96% better performance per dollar.

No absolute throughput (rows/sec), latency (p50/p99), or cost-per-TB numbers are disclosed. The figures are Databricks-measured and Databricks-disclosed; external reproduction not published.

Named adopters

Adopter Use case Quote
Navy Federal Credit Union Large-scale real-time event processing, billions of application events/day "The simplicity of the Spark Declarative Pipelines programming model combined with its service capabilities resulted in an incredibly fast turnaround time." (Jian Zhou, Senior Engineering Manager)
Block Streaming pipelines on Delta Lake "The time required to define and develop a streaming pipeline has gone from days to hours." (Yue Zhang, Staff SWE, Data Foundations)
Valora Group Swiss foodvenience retail analytics, master data CDC "We gained a lot by doing CDC in SDP, because you don't write any code — it's all abstracted in the background. AutoCDC minimizes the number of lines… it's so easy to do." (Alexane Rose, Data and AI Architect)

All three are in regulated verticals (banking, payments, retail), where CDC correctness is load-bearing.

Why it matters for sysdesign

AutoCDC canonicalises the declarative API surface for CDC — a problem class (MERGE logic for out-of-order updates, deletes, late-arriving data, idempotency) that was historically always hand-rolled on Spark/Iceberg/Delta. The wiki's declarative-vs- imperative stream API axis gains a new instance: unlike Zalando's Flink-SQL → DataStream-API rewrite (where declarative lost on the 10% of state-amplifying workloads), the CDC workload is squarely in the 90%-declarative-wins camp. The architectural move is to bound the correctness envelope of the API so Runtime-level improvements can be applied universally.

Caveats

  • Declarative boundary not formally specified. The post doesn't name which CDC workloads don't fit AutoCDC — which 10% the abstraction leaks on. Candidates: multi-column sequencing, tombstone-vs-missing deletion ambiguity in snapshot mode, conflicting per-key SCD-type policies in the same table.
  • Snapshot-diff semantics underspecified in public docs. How deletion is inferred (missing row? tombstone?), how schema drift across snapshots is handled, what memory footprint full-snapshot diffing has at TB scale — not disclosed in the post.
  • Vendor-gated. Databricks-only; no open-source runtime equivalent. Migration away requires returning to hand-rolled MERGE or re-implementing sequencing logic in another declarative pipeline system.

Seen in

  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — first and only wiki source on AutoCDC. Databricks' product- engineering post positions AutoCDC as the declarative alternative to hand-rolled MERGE for CDC and SCD pipelines, with three input modes (CDF SCD1, CDF SCD2, snapshot-diff), four named semantic parameters (keys, sequence_by, apply_as_deletes, stored_as_scd_type), and 71% / 96% perf-per-dollar improvements disclosed for the respective SCD types since Nov 2025. Named adopters: Navy Federal Credit Union, Block, Valora Group. First canonical wiki system page for a declarative CDC API surface.
Last updated · 517 distilled / 1,221 read