CONCEPT Cited by 3 sources
Change Data Capture (CDC)¶
Change Data Capture (CDC) is the practice of materialising a table as an ongoing stream of insert / update / delete deltas rather than (or alongside) an authoritative "current state" table. Downstream consumers apply the stream to reconstruct current state, or snapshot it periodically.
Why it shows up in this wiki¶
CDC is the upstream pattern that forces the downstream pattern of concepts/copy-on-write-merge in data-lake table formats like systems/apache-iceberg, systems/apache-hudi, and systems/delta-lake. Specifically:
- Each CDC delta is one or more files on object storage.
- Readers would otherwise have to merge every delta at read time to reconstruct table state — cheap for short streams, impossible for exabyte-scale accumulations.
- Compaction (copy-on-write merge) collapses the stream periodically to keep read cost bounded.
Amazon BDT's CDC framing¶
From the Spark-to-Ray migration post, Amazon's post-Oracle data catalog modelled every table as an unbounded stream of S3 files, each file holding records to insert, update, or delete:
"all tables in their catalog being composed of unbounded streams of Amazon S3 files, where each file contained records to either insert, update, or delete. It was the responsibility of each subscriber's chosen compute framework to dynamically apply, or 'merge,' all of these changes at read time to yield the correct current table state." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)
Two degenerate shapes they called out:
- Millions of kilobyte-scale tiny files — small CDC batches that never got compacted.
- Few terabyte-scale huge files — big bulk deltas that stress downstream readers the other direction.
Both are anti-patterns that a compactor has to fix while merging.
Two compaction modes on CDC¶
- Append-only compaction — all deltas are pure inserts; merge reduces to concatenation, but file sizing still matters.
- Upsert compaction — deltas contain updates (with a merge-key schema); only the latest value per key survives. The canonical Iceberg/Hudi/Delta workload.
Related¶
- concepts/copy-on-write-merge — the compaction strategy CDC logs require.
- concepts/open-table-format — formats that formalize the delta-log + snapshot model.
- concepts/lsm-compaction — the closest cousin pattern inside LSM-style stores.
Seen in¶
- sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — CDC as the source-shape driving the scalability problem that motivated Amazon BDT's compactor work, first on Spark then on Ray; the insert/update/delete delta semantics are named explicitly.
- sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
— Expedia names CDC as one of the two canonical use cases
(alongside SCD) that make
MERGE INTOthe right default on systems/apache-iceberg; the CDC workload shape is precisely what motivates MOR over COW for write efficiency. - sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scale — CDC as the substrate of an online invalidation-based cache. LiveGraph's stateless invalidator tails Postgres's WAL logical replication stream per DB shard — the same CDC stream the database itself uses to keep replicas up-to-date — and emits invalidation messages computed from the mutation + the schema's query shapes. Distinct pattern from warehouse-style CDC compaction (concepts/copy-on-write-merge): here the CDC consumer is a latency-sensitive cache invalidator, not a periodic batch compactor. Demonstrates CDC as a first-class integration substrate between OLTP Postgres and downstream real- time data services, not just ETL / replication.
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — CDC as the substrate of a managed multi-tenant replication platform at company scale. Datadog built a unified CDC platform on Debezium + Kafka + Kafka Connect with five sink classes (Elasticsearch, Postgres-to-Postgres for shared-DB unwinding, Iceberg for analytics, Cassandra, cross-region Kafka). Operating points: page latency 30 s → 1 s (up to 97%), replication lag ~500 ms. Three new framings vs prior wiki CDC instances: (1) CDC as the primary split- database-and-search answer — explicitly the opposite direction from Cars24's Atlas-Search consolidation; (2) CDC provisioning as a workflow-engine problem ( Temporal decomposing the 7-step manual runbook); (3) CDC's schema-evolution hard-problem as a two-layer answer (offline + runtime). Named operating mode throughout: async, chosen explicitly over sync for scalability over strict consistency.