Skip to content

CONCEPT Cited by 3 sources

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of materialising a table as an ongoing stream of insert / update / delete deltas rather than (or alongside) an authoritative "current state" table. Downstream consumers apply the stream to reconstruct current state, or snapshot it periodically.

Why it shows up in this wiki

CDC is the upstream pattern that forces the downstream pattern of concepts/copy-on-write-merge in data-lake table formats like systems/apache-iceberg, systems/apache-hudi, and systems/delta-lake. Specifically:

  • Each CDC delta is one or more files on object storage.
  • Readers would otherwise have to merge every delta at read time to reconstruct table state — cheap for short streams, impossible for exabyte-scale accumulations.
  • Compaction (copy-on-write merge) collapses the stream periodically to keep read cost bounded.

Amazon BDT's CDC framing

From the Spark-to-Ray migration post, Amazon's post-Oracle data catalog modelled every table as an unbounded stream of S3 files, each file holding records to insert, update, or delete:

"all tables in their catalog being composed of unbounded streams of Amazon S3 files, where each file contained records to either insert, update, or delete. It was the responsibility of each subscriber's chosen compute framework to dynamically apply, or 'merge,' all of these changes at read time to yield the correct current table state." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Two degenerate shapes they called out:

  • Millions of kilobyte-scale tiny files — small CDC batches that never got compacted.
  • Few terabyte-scale huge files — big bulk deltas that stress downstream readers the other direction.

Both are anti-patterns that a compactor has to fix while merging.

Two compaction modes on CDC

  • Append-only compaction — all deltas are pure inserts; merge reduces to concatenation, but file sizing still matters.
  • Upsert compaction — deltas contain updates (with a merge-key schema); only the latest value per key survives. The canonical Iceberg/Hudi/Delta workload.

Seen in

Last updated · 200 distilled / 1,178 read