PATTERN Cited by 1 source
Declarative CDC over hand-rolled MERGE¶
Pattern. When building a
CDC or
SCD pipeline, prefer a
declarative API (where the author names semantics — keys, sequence
column, delete predicate, SCD type — and the runtime chooses the
implementation) over hand-rolled MERGE logic (where the author
writes the MERGE clauses, window functions, staging tables, and
sequencing assumptions by hand).
The problem¶
Hand-rolled CDC pipelines are one of the most painful pipeline classes to build and operate. Four structural sources of pain, per Databricks' 2026-04-22 framing:
- Out-of-order updates. Events arrive in a different order than they occurred; naive application corrupts state. Requires a sequencing column and explicit ordering (concepts/out-of-sequence-cdc-event-handling).
- Duplicate events. Retries, fan-out, multi-source dedup all produce duplicates; must be collapsed before MERGE.
- Delete application. Deletes must be applied correctly against both current-state and history tables; hand-rolled MERGE often conflates delete semantics with update semantics.
- Idempotency across retries and reprocessing. A pipeline that produces correct output on the first try but corrupts state on retry is a production bug waiting for the next outage (concepts/idempotent-job-design).
SCD Type 2 adds:
- Version management — tracking
valid_from/valid_towindows across row versions, ensuring exactly one "current" version per key exists. - Late-arriving updates — a sequence-5 update arriving after a sequence-6 update must close the sequence-5 version window at 6 without corrupting the existing sequence-6 row.
And separately:
- Snapshot-diff inference when the source doesn't emit a native change feed (concepts/snapshot-diff-inference-cdc).
Each source of pain corresponds to a specific shape of hand-rolled
logic — staging tables, window functions, max_by dedup, multi-step
MERGE statements with both WHEN MATCHED DELETE and WHEN MATCHED
UPDATE clauses, separate passes to close-out + insert history rows.
A typical production pipeline runs 40-200+ lines of SQL / PySpark.
The declarative answer¶
Name the semantics as API parameters, let the runtime handle the implementation. The canonical instance is AutoCDC:
dp.create_auto_cdc_flow(
target="target",
source="users",
keys=["userId"],
sequence_by=col("sequenceNum"),
apply_as_deletes=expr("operation = 'DELETE'"),
stored_as_scd_type=2
)
Seven parameters express the whole semantic envelope. The pipeline author is making four declarations, each collapsing one of the pain sources:
| Pain source | Declarative parameter |
|---|---|
| Out-of-order updates | sequence_by — runtime enforces ordering |
| Duplicate events | Implicit in sequence_by — runtime picks highest-sequence per key |
| Delete application | apply_as_deletes — predicate tells runtime which events are deletes |
| Idempotency / reprocessing | Inherited from the Lakeflow SDP runtime, not something the author declares |
| SCD version management | stored_as_scd_type=2 — runtime owns __START_AT / __END_AT columns |
| Late-arriving updates | Inherited from SDP runtime + sequence_by |
| Snapshot-diff inference | Alternative input mode; runtime derives deltas |
The hand-rolled equivalent is ~25 lines of two-step MERGE plus dedupe; the declarative version is ~7 lines total.
Why the declarative API wins¶
- Bounded correctness envelope. The runtime's invariants apply to every AutoCDC pipeline uniformly; you cannot accidentally write a pipeline that silently drops deletes or mis-orders updates. The hand-rolled version can.
- Runtime improvements propagate. Databricks disclosed 71% better performance per dollar on SCD Type 1 and 96% on SCD Type 2 from internal Runtime improvements since Nov 2025 — all AutoCDC pipelines benefited automatically. Hand-rolled MERGE with bespoke window functions does not.
- Evolution is cheaper. Adding a new key column, changing the delete predicate, or flipping SCD Type 1 → Type 2 is a parameter change. The same in hand-rolled MERGE is a pipeline rewrite.
- LLM codegen becomes tractable. When the code author is an LLM (Genie Code, Copilot, Cursor), the declarative API bounds what the LLM can get wrong. Hand-rolled MERGE has unbounded failure modes.
- Teams can reason about the pipeline. A 200-line hand-rolled MERGE pipeline has a bus factor of 1 — whoever wrote it. A 7-line AutoCDC definition has a bus factor of the whole team.
"We gained a lot by doing CDC in SDP, because you don't write any code — it's all abstracted in the background. AutoCDC minimizes the number of lines… it's so easy to do." — Alexane Rose, Data and AI Architect, Valora Group (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
When this pattern fits¶
- Standard CDC / SCD workloads. The pipeline needs SCD Type 1 or Type 2 semantics against a key with a clear sequencing column. This is the 90% case.
- Snapshot-diff inference. Upstream emits snapshots; the consumer needs delta semantics.
- Regulated verticals. When auditability and bounded-correctness arguments matter more than flexibility — banking, payments, regulated retail. The three named adopters in the source post (Navy Federal Credit Union, Block, Valora Group) all fit this profile.
When it doesn't fit¶
- Multi-column sequencing. If logical order requires comparing
(event_time, source_shard, within_shard_sequence), a single-columnsequence_bymay not capture it. May need to derive a composite. - Per-key SCD-type heterogeneity. AutoCDC's
stored_as_scd_typeis per-target-table. If some keys need Type 1 (employee name typos) and others Type 2 (employee department history), the pipeline must split into two AutoCDC flows. - Non-standard merge semantics. If the deletion policy depends
on multiple columns, or update-vs-insert disambiguation requires
custom logic, hand-rolled MERGE retains authorial control that
apply_as_deletescannot match. - Non-Databricks platforms. The declarative API is vendor-gated; outside Databricks, the pattern requires rebuilding the semantic surface on another platform.
Composition with MERGE INTO¶
This pattern does not replace the underlying MERGE INTO over INSERT OVERWRITE pattern. MERGE is still the operational primitive the runtime chooses. What this pattern does replace is hand-authoring the MERGE. The declarative API becomes the authoring surface; MERGE becomes an implementation detail the runtime owns.
"While MERGE INTO remains a foundational Spark primitive, AutoCDC builds on it to handle out-of-sequence data and incremental processing more efficiently as data volumes grow." (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
Anti-patterns¶
- Claiming LLM codegen "fixes" hand-rolled MERGE. Generating MERGE logic faster doesn't change its unbounded failure envelope. Either the LLM generates declarative AutoCDC (bounded) or the author accepts the correctness burden of MERGE (unbounded).
- Skipping
sequence_bybecause "events arrive in order". Kafka partition rebalances, multi-source aggregation, retries all reorder. Declare the sequencing column even when arrival looks ordered; it's cheap insurance. - Inlining hand-rolled dedup alongside declarative pipelines.
Mixing
max_by(struct("*"), "sequenceNum")staging logic with AutoCDC flows defeats the point — the declarative guarantees only apply to what the runtime owns. - Treating the 71% / 96% perf gains as inherent properties of declarative APIs. They're Runtime improvements Databricks shipped; the architectural point is only that declarative APIs let such improvements propagate universally.
Seen in¶
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — canonical Databricks framing of the declarative CDC pattern. Introduces AutoCDC as the implementation; contrasts ~6–10 lines of AutoCDC against 40–200+ lines of hand-rolled MERGE across SCD Type 1, SCD Type 2, and snapshot-diff inference input modes. Named adopters (Navy Federal Credit Union, Block, Valora Group) report code-reduction gains ("4 lines could replace 1,500 lines", "days to hours") and confirm the pattern's bounded-correctness argument in regulated- vertical production. First canonical wiki pattern page for the declarative-CDC axis.