CONCEPT Cited by 1 source
Out-of-sequence CDC event handling¶
Out-of-sequence CDC event handling is the class of correctness
problems that arise when
CDC events arrive at the consumer in
a different order than the logical order they occurred in at the
source. The canonical solution is a sequencing column — a
monotonically-increasing value the source emits with each change
(sequenceNum, event_time, LSN, GTID) — that the consumer uses to
establish logical order independent of arrival order.
The problem shape¶
Arrival order diverges from logical order for many reasons:
- Kafka partition reassignment. A consumer-group rebalance can cause older records on one partition to appear after newer records on another.
- Multi-shard source aggregation. Events from sharded upstream databases merge into a single topic with per-shard ordering but no cross-shard global order.
- Retries. A failed batch is retried after newer batches have succeeded.
- Stream-to-batch re-ingestion. A hand-rolled pipeline backfills a historical range while a live stream keeps arriving.
- Deduplication across sources. Multiple CDC readers on the same source (e.g. per-region) dedupe into a single consumer stream with interleaved order.
Without explicit ordering, applying events in arrival order
corrupts state: an older UPDATE overwrites a newer UPDATE; a
DELETE applied before a subsequent INSERT loses the insert; a
duplicated event doubles-apply.
The sequencing-column primitive¶
The sequencing column is the load-bearing primitive:
- Source emits a sequencing value with every change — database LSN / binlog coordinate / GTID / application-level timestamp / monotonic counter.
- Consumer reads the sequencing column and uses it to decide which of two events for the same key is logically newer.
- Conflict resolution picks the event with the higher (or lower, per policy) sequence value as authoritative.
AutoCDC takes the sequencing column as
a first-class API parameter: sequence_by=col("sequenceNum").
Worked example from the source¶
The CDC feed in the Databricks post is deliberately out-of-order:
| userId | operation | sequenceNum |
|---|---|---|
| 124 | INSERT | 1 |
| 123 | INSERT | 1 |
| 125 | INSERT | 2 |
| 126 | INSERT | 2 |
| 123 | DELETE | 6 |
| 125 | UPDATE (Guadalajara) | 6 |
| 125 | UPDATE (Mexicali) | 5 |
| 123 | UPDATE (Chihuahua) | 5 |
Arrival order puts the sequenceNum=6 DELETE for user 123 before
the sequenceNum=5 UPDATE for user 123. A naive pipeline applies
them in arrival order — the delete succeeds, then the update
re-creates a ghost row. The correct outcome is that user 123 does
not appear in the SCD Type 1 output at all. AutoCDC's sequence_by
handles this automatically.
For SCD Type 2 the same out-of-order feed produces the correct version history:
| id | name | city | __START_AT | __END_AT |
|---|---|---|---|---|
| 123 | Isabel | Monterrey | 1 | 5 |
| 123 | Isabel | Chihuahua | 5 | 6 |
| 125 | Mercedes | Tijuana | 2 | 5 |
| 125 | Mercedes | Mexicali | 5 | 6 |
| 125 | Mercedes | Guadalajara | 6 | NULL |
User 123 has two closed versions (ended at sequence 6 when deleted); user 125 has three versions showing ordered city changes despite arrival being out of order. (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)
Why hand-rolled pipelines get this wrong¶
Hand-rolled CDC typically fails out-of-sequence handling in one of three ways:
- Implicit arrival-order assumption. The pipeline is written assuming event arrival order = event logical order, usually because early tests don't produce out-of-order streams.
- Dedup without ordering. The pipeline dedupes on key but picks an arbitrary survivor (e.g., "last arrived") rather than the highest-sequence survivor.
- Window-function dedupe without retry-safety. The pipeline
uses
ROW_NUMBER() OVER (PARTITION BY key ORDER BY sequence)correctly within a batch, but reprocessing the batch against already-applied target rows double-applies the top row.
The idiomatic fix in hand-rolled MERGE is the pre-dedupe step seen in the post's appendix:
updates = (spark.read.table("cdc_data.users")
.groupBy("userId")
.agg(max_by(struct("*"), "sequenceNum").alias("row"))
.select("row.*"))
Combined with a WHEN MATCHED AND s.sequenceNum > t.sequenceNum
clause, this gives correct out-of-sequence behaviour. AutoCDC
subsumes all of it into the single sequence_by parameter.
Relationship to idempotency¶
Out-of-sequence handling is a necessary but not sufficient condition
for idempotent CDC pipelines.
Sequencing-based dedup makes event order robust; retry / exactly-
once semantics make event count robust. Both are needed. AutoCDC
inherits the latter from
Lakeflow SDP's
incremental-progress tracking; sequence_by provides the former.
Seen in¶
- sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines
— Databricks canonicalises
sequence_byas the declarative primitive for out-of-sequence CDC correctness inside AutoCDC. Out-of-order CDC feed worked example (users 123 / 125 with DELETE and UPDATE arriving before earlier UPDATE) illustrates both SCD Type 1 and SCD Type 2 behaviour. First wiki source to pin out-of-sequence handling as a first-class CDC correctness concern separate from idempotency.