PATTERN Cited by 1 source

Declarative CDC over hand-rolled MERGE¶

Pattern. When building a CDC or SCD pipeline, prefer a declarative API (where the author names semantics — keys, sequence column, delete predicate, SCD type — and the runtime chooses the implementation) over hand-rolled MERGE logic (where the author writes the MERGE clauses, window functions, staging tables, and sequencing assumptions by hand).

The problem¶

Hand-rolled CDC pipelines are one of the most painful pipeline classes to build and operate. Four structural sources of pain, per Databricks' 2026-04-22 framing:

Out-of-order updates. Events arrive in a different order than they occurred; naive application corrupts state. Requires a sequencing column and explicit ordering (concepts/out-of-sequence-cdc-event-handling).
Duplicate events. Retries, fan-out, multi-source dedup all produce duplicates; must be collapsed before MERGE.
Delete application. Deletes must be applied correctly against both current-state and history tables; hand-rolled MERGE often conflates delete semantics with update semantics.
Idempotency across retries and reprocessing. A pipeline that produces correct output on the first try but corrupts state on retry is a production bug waiting for the next outage (concepts/idempotent-job-design).

SCD Type 2 adds:

Version management — tracking valid_from / valid_to windows across row versions, ensuring exactly one "current" version per key exists.
Late-arriving updates — a sequence-5 update arriving after a sequence-6 update must close the sequence-5 version window at 6 without corrupting the existing sequence-6 row.

And separately:

Snapshot-diff inference when the source doesn't emit a native change feed (concepts/snapshot-diff-inference-cdc).

Each source of pain corresponds to a specific shape of hand-rolled logic — staging tables, window functions, max_by dedup, multi-step MERGE statements with both WHEN MATCHED DELETE and WHEN MATCHED UPDATE clauses, separate passes to close-out + insert history rows. A typical production pipeline runs 40-200+ lines of SQL / PySpark.

The declarative answer¶

Name the semantics as API parameters, let the runtime handle the implementation. The canonical instance is AutoCDC:

dp.create_auto_cdc_flow(
    target="target",
    source="users",
    keys=["userId"],
    sequence_by=col("sequenceNum"),
    apply_as_deletes=expr("operation = 'DELETE'"),
    stored_as_scd_type=2
)

Seven parameters express the whole semantic envelope. The pipeline author is making four declarations, each collapsing one of the pain sources:

Pain source	Declarative parameter
Out-of-order updates	`sequence_by` — runtime enforces ordering
Duplicate events	Implicit in `sequence_by` — runtime picks highest-sequence per key
Delete application	`apply_as_deletes` — predicate tells runtime which events are deletes
Idempotency / reprocessing	Inherited from the Lakeflow SDP runtime, not something the author declares
SCD version management	`stored_as_scd_type=2` — runtime owns `__START_AT` / `__END_AT` columns
Late-arriving updates	Inherited from SDP runtime + `sequence_by`
Snapshot-diff inference	Alternative input mode; runtime derives deltas

The hand-rolled equivalent is ~25 lines of two-step MERGE plus dedupe; the declarative version is ~7 lines total.

Why the declarative API wins¶

Bounded correctness envelope. The runtime's invariants apply to every AutoCDC pipeline uniformly; you cannot accidentally write a pipeline that silently drops deletes or mis-orders updates. The hand-rolled version can.
Runtime improvements propagate. Databricks disclosed 71% better performance per dollar on SCD Type 1 and 96% on SCD Type 2 from internal Runtime improvements since Nov 2025 — all AutoCDC pipelines benefited automatically. Hand-rolled MERGE with bespoke window functions does not.
Evolution is cheaper. Adding a new key column, changing the delete predicate, or flipping SCD Type 1 → Type 2 is a parameter change. The same in hand-rolled MERGE is a pipeline rewrite.
LLM codegen becomes tractable. When the code author is an LLM (Genie Code, Copilot, Cursor), the declarative API bounds what the LLM can get wrong. Hand-rolled MERGE has unbounded failure modes.
Teams can reason about the pipeline. A 200-line hand-rolled MERGE pipeline has a bus factor of 1 — whoever wrote it. A 7-line AutoCDC definition has a bus factor of the whole team.

"We gained a lot by doing CDC in SDP, because you don't write any code — it's all abstracted in the background. AutoCDC minimizes the number of lines… it's so easy to do." — Alexane Rose, Data and AI Architect, Valora Group (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

When this pattern fits¶

Standard CDC / SCD workloads. The pipeline needs SCD Type 1 or Type 2 semantics against a key with a clear sequencing column. This is the 90% case.
Snapshot-diff inference. Upstream emits snapshots; the consumer needs delta semantics.
Regulated verticals. When auditability and bounded-correctness arguments matter more than flexibility — banking, payments, regulated retail. The three named adopters in the source post (Navy Federal Credit Union, Block, Valora Group) all fit this profile.

When it doesn't fit¶

Multi-column sequencing. If logical order requires comparing (event_time, source_shard, within_shard_sequence), a single-column sequence_by may not capture it. May need to derive a composite.
Per-key SCD-type heterogeneity. AutoCDC's stored_as_scd_type is per-target-table. If some keys need Type 1 (employee name typos) and others Type 2 (employee department history), the pipeline must split into two AutoCDC flows.
Non-standard merge semantics. If the deletion policy depends on multiple columns, or update-vs-insert disambiguation requires custom logic, hand-rolled MERGE retains authorial control that apply_as_deletes cannot match.
Non-Databricks platforms. The declarative API is vendor-gated; outside Databricks, the pattern requires rebuilding the semantic surface on another platform.

Composition with MERGE INTO¶

This pattern does not replace the underlying MERGE INTO over INSERT OVERWRITE pattern. MERGE is still the operational primitive the runtime chooses. What this pattern does replace is hand-authoring the MERGE. The declarative API becomes the authoring surface; MERGE becomes an implementation detail the runtime owns.

"While MERGE INTO remains a foundational Spark primitive, AutoCDC builds on it to handle out-of-sequence data and incremental processing more efficiently as data volumes grow." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

Anti-patterns¶

Claiming LLM codegen "fixes" hand-rolled MERGE. Generating MERGE logic faster doesn't change its unbounded failure envelope. Either the LLM generates declarative AutoCDC (bounded) or the author accepts the correctness burden of MERGE (unbounded).
Skipping sequence_by because "events arrive in order". Kafka partition rebalances, multi-source aggregation, retries all reorder. Declare the sequencing column even when arrival looks ordered; it's cheap insurance.
Inlining hand-rolled dedup alongside declarative pipelines. Mixing max_by(struct("*"), "sequenceNum") staging logic with AutoCDC flows defeats the point — the declarative guarantees only apply to what the runtime owns.
Treating the 71% / 96% perf gains as inherent properties of declarative APIs. They're Runtime improvements Databricks shipped; the architectural point is only that declarative APIs let such improvements propagate universally.

Seen in¶

sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — canonical Databricks framing of the declarative CDC pattern. Introduces AutoCDC as the implementation; contrasts ~6–10 lines of AutoCDC against 40–200+ lines of hand-rolled MERGE across SCD Type 1, SCD Type 2, and snapshot-diff inference input modes. Named adopters (Navy Federal Credit Union, Block, Valora Group) report code-reduction gains ("4 lines could replace 1,500 lines", "days to hours") and confirm the pattern's bounded-correctness argument in regulated- vertical production. First canonical wiki pattern page for the declarative-CDC axis.