Skip to content

CONCEPT Cited by 1 source

Out-of-sequence CDC event handling

Out-of-sequence CDC event handling is the class of correctness problems that arise when CDC events arrive at the consumer in a different order than the logical order they occurred in at the source. The canonical solution is a sequencing column — a monotonically-increasing value the source emits with each change (sequenceNum, event_time, LSN, GTID) — that the consumer uses to establish logical order independent of arrival order.

The problem shape

Arrival order diverges from logical order for many reasons:

  • Kafka partition reassignment. A consumer-group rebalance can cause older records on one partition to appear after newer records on another.
  • Multi-shard source aggregation. Events from sharded upstream databases merge into a single topic with per-shard ordering but no cross-shard global order.
  • Retries. A failed batch is retried after newer batches have succeeded.
  • Stream-to-batch re-ingestion. A hand-rolled pipeline backfills a historical range while a live stream keeps arriving.
  • Deduplication across sources. Multiple CDC readers on the same source (e.g. per-region) dedupe into a single consumer stream with interleaved order.

Without explicit ordering, applying events in arrival order corrupts state: an older UPDATE overwrites a newer UPDATE; a DELETE applied before a subsequent INSERT loses the insert; a duplicated event doubles-apply.

The sequencing-column primitive

The sequencing column is the load-bearing primitive:

  • Source emits a sequencing value with every change — database LSN / binlog coordinate / GTID / application-level timestamp / monotonic counter.
  • Consumer reads the sequencing column and uses it to decide which of two events for the same key is logically newer.
  • Conflict resolution picks the event with the higher (or lower, per policy) sequence value as authoritative.

AutoCDC takes the sequencing column as a first-class API parameter: sequence_by=col("sequenceNum").

Worked example from the source

The CDC feed in the Databricks post is deliberately out-of-order:

userId operation sequenceNum
124 INSERT 1
123 INSERT 1
125 INSERT 2
126 INSERT 2
123 DELETE 6
125 UPDATE (Guadalajara) 6
125 UPDATE (Mexicali) 5
123 UPDATE (Chihuahua) 5

Arrival order puts the sequenceNum=6 DELETE for user 123 before the sequenceNum=5 UPDATE for user 123. A naive pipeline applies them in arrival order — the delete succeeds, then the update re-creates a ghost row. The correct outcome is that user 123 does not appear in the SCD Type 1 output at all. AutoCDC's sequence_by handles this automatically.

For SCD Type 2 the same out-of-order feed produces the correct version history:

id name city __START_AT __END_AT
123 Isabel Monterrey 1 5
123 Isabel Chihuahua 5 6
125 Mercedes Tijuana 2 5
125 Mercedes Mexicali 5 6
125 Mercedes Guadalajara 6 NULL

User 123 has two closed versions (ended at sequence 6 when deleted); user 125 has three versions showing ordered city changes despite arrival being out of order. (Source: sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines)

Why hand-rolled pipelines get this wrong

Hand-rolled CDC typically fails out-of-sequence handling in one of three ways:

  1. Implicit arrival-order assumption. The pipeline is written assuming event arrival order = event logical order, usually because early tests don't produce out-of-order streams.
  2. Dedup without ordering. The pipeline dedupes on key but picks an arbitrary survivor (e.g., "last arrived") rather than the highest-sequence survivor.
  3. Window-function dedupe without retry-safety. The pipeline uses ROW_NUMBER() OVER (PARTITION BY key ORDER BY sequence) correctly within a batch, but reprocessing the batch against already-applied target rows double-applies the top row.

The idiomatic fix in hand-rolled MERGE is the pre-dedupe step seen in the post's appendix:

updates = (spark.read.table("cdc_data.users")
    .groupBy("userId")
    .agg(max_by(struct("*"), "sequenceNum").alias("row"))
    .select("row.*"))

Combined with a WHEN MATCHED AND s.sequenceNum > t.sequenceNum clause, this gives correct out-of-sequence behaviour. AutoCDC subsumes all of it into the single sequence_by parameter.

Relationship to idempotency

Out-of-sequence handling is a necessary but not sufficient condition for idempotent CDC pipelines. Sequencing-based dedup makes event order robust; retry / exactly- once semantics make event count robust. Both are needed. AutoCDC inherits the latter from Lakeflow SDP's incremental-progress tracking; sequence_by provides the former.

Seen in

  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelines — Databricks canonicalises sequence_by as the declarative primitive for out-of-sequence CDC correctness inside AutoCDC. Out-of-order CDC feed worked example (users 123 / 125 with DELETE and UPDATE arriving before earlier UPDATE) illustrates both SCD Type 1 and SCD Type 2 behaviour. First wiki source to pin out-of-sequence handling as a first-class CDC correctness concern separate from idempotency.
Last updated · 517 distilled / 1,221 read