CONCEPT Cited by 1 source

Out-of-sequence CDC event handling¶

Out-of-sequence CDC event handling is the class of correctness problems that arise when CDC events arrive at the consumer in a different order than the logical order they occurred in at the source. The canonical solution is a sequencing column — a monotonically-increasing value the source emits with each change (sequenceNum, event_time, LSN, GTID) — that the consumer uses to establish logical order independent of arrival order.

The problem shape¶

Arrival order diverges from logical order for many reasons:

Kafka partition reassignment. A consumer-group rebalance can cause older records on one partition to appear after newer records on another.
Multi-shard source aggregation. Events from sharded upstream databases merge into a single topic with per-shard ordering but no cross-shard global order.
Retries. A failed batch is retried after newer batches have succeeded.
Stream-to-batch re-ingestion. A hand-rolled pipeline backfills a historical range while a live stream keeps arriving.
Deduplication across sources. Multiple CDC readers on the same source (e.g. per-region) dedupe into a single consumer stream with interleaved order.

Without explicit ordering, applying events in arrival order corrupts state: an older UPDATE overwrites a newer UPDATE; a DELETE applied before a subsequent INSERT loses the insert; a duplicated event doubles-apply.

The sequencing-column primitive¶

The sequencing column is the load-bearing primitive:

Source emits a sequencing value with every change — database LSN / binlog coordinate / GTID / application-level timestamp / monotonic counter.
Consumer reads the sequencing column and uses it to decide which of two events for the same key is logically newer.
Conflict resolution picks the event with the higher (or lower, per policy) sequence value as authoritative.

AutoCDC takes the sequencing column as a first-class API parameter: sequence_by=col("sequenceNum").

Worked example from the source¶

The CDC feed in the Databricks post is deliberately out-of-order:

userId	operation	sequenceNum
124	INSERT	1
123	INSERT	1
125	INSERT	2
126	INSERT	2
123	DELETE	6
125	UPDATE (Guadalajara)	6
125	UPDATE (Mexicali)	5
123	UPDATE (Chihuahua)	5

Arrival order puts the sequenceNum=6 DELETE for user 123 before the sequenceNum=5 UPDATE for user 123. A naive pipeline applies them in arrival order — the delete succeeds, then the update re-creates a ghost row. The correct outcome is that user 123 does not appear in the SCD Type 1 output at all. AutoCDC's sequence_by handles this automatically.

For SCD Type 2 the same out-of-order feed produces the correct version history:

id	name	city	__START_AT	__END_AT
123	Isabel	Monterrey	1	5
123	Isabel	Chihuahua	5	6
125	Mercedes	Tijuana	2	5
125	Mercedes	Mexicali	5	6
125	Mercedes	Guadalajara	6	NULL

User 123 has two closed versions (ended at sequence 6 when deleted); user 125 has three versions showing ordered city changes despite arrival being out of order. (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

Why hand-rolled pipelines get this wrong¶

Hand-rolled CDC typically fails out-of-sequence handling in one of three ways:

Implicit arrival-order assumption. The pipeline is written assuming event arrival order = event logical order, usually because early tests don't produce out-of-order streams.
Dedup without ordering. The pipeline dedupes on key but picks an arbitrary survivor (e.g., "last arrived") rather than the highest-sequence survivor.
Window-function dedupe without retry-safety. The pipeline uses ROW_NUMBER() OVER (PARTITION BY key ORDER BY sequence) correctly within a batch, but reprocessing the batch against already-applied target rows double-applies the top row.

The idiomatic fix in hand-rolled MERGE is the pre-dedupe step seen in the post's appendix:

updates = (spark.read.table("cdc_data.users")
    .groupBy("userId")
    .agg(max_by(struct("*"), "sequenceNum").alias("row"))
    .select("row.*"))

Combined with a WHEN MATCHED AND s.sequenceNum > t.sequenceNum clause, this gives correct out-of-sequence behaviour. AutoCDC subsumes all of it into the single sequence_by parameter.

Relationship to idempotency¶

Out-of-sequence handling is a necessary but not sufficient condition for idempotent CDC pipelines. Sequencing-based dedup makes event order robust; retry / exactly- once semantics make event count robust. Both are needed. AutoCDC inherits the latter from Lakeflow SDP's incremental-progress tracking; sequence_by provides the former.

Seen in¶

sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — Databricks canonicalises sequence_by as the declarative primitive for out-of-sequence CDC correctness inside AutoCDC. Out-of-order CDC feed worked example (users 123 / 125 with DELETE and UPDATE arriving before earlier UPDATE) illustrates both SCD Type 1 and SCD Type 2 behaviour. First wiki source to pin out-of-sequence handling as a first-class CDC correctness concern separate from idempotency.