Skip to content

CONCEPT Cited by 11 sources

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of materialising a table as an ongoing stream of insert / update / delete deltas rather than (or alongside) an authoritative "current state" table. Downstream consumers apply the stream to reconstruct current state, or snapshot it periodically.

Why it shows up in this wiki

CDC is the upstream pattern that forces the downstream pattern of concepts/copy-on-write-merge in data-lake table formats like systems/apache-iceberg, systems/apache-hudi, and systems/delta-lake. Specifically:

  • Each CDC delta is one or more files on object storage.
  • Readers would otherwise have to merge every delta at read time to reconstruct table state — cheap for short streams, impossible for exabyte-scale accumulations.
  • Compaction (copy-on-write merge) collapses the stream periodically to keep read cost bounded.

Amazon BDT's CDC framing

From the Spark-to-Ray migration post, Amazon's post-Oracle data catalog modelled every table as an unbounded stream of S3 files, each file holding records to insert, update, or delete:

"all tables in their catalog being composed of unbounded streams of Amazon S3 files, where each file contained records to either insert, update, or delete. It was the responsibility of each subscriber's chosen compute framework to dynamically apply, or 'merge,' all of these changes at read time to yield the correct current table state." (Source: sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2)

Two degenerate shapes they called out:

  • Millions of kilobyte-scale tiny files — small CDC batches that never got compacted.
  • Few terabyte-scale huge files — big bulk deltas that stress downstream readers the other direction.

Both are anti-patterns that a compactor has to fix while merging.

Two compaction modes on CDC

  • Append-only compaction — all deltas are pure inserts; merge reduces to concatenation, but file sizing still matters.
  • Upsert compaction — deltas contain updates (with a merge-key schema); only the latest value per key survives. The canonical Iceberg/Hudi/Delta workload.

CDC-system migration framing

When a CDC system itself is migrated to another CDC system against the same source, the bad-data-propagation hazard becomes a cross-system concern: two pipelines are running side by side, and a divergence between them needs both detection (continuous row-count + checksum comparison) and containment (partition-level quality marking) before either system's bad partition feeds forward.

Meta's data-ingestion-system migration (sources/2026-05-12-meta-migrating-data-ingestion-systems-at-meta-scale) is the wiki's canonical instance — covers the migration shape (patterns/shadow-then-reverse-shadow-migration) + the control loop that scales it (patterns/automated-job-lifecycle-promotion) + the bad-data containment primitive (patterns/partition-marking-stops-cdc-bleeding) + CDC-migration-specific optimisations (patterns/snapshot-reuse-from-legacy-during-migration, patterns/known-issue-exclusion-batch-selection).

Seen in

  • sources/2026-04-09-redpanda-oracle-cdc-now-available-in-redpanda-connectsixth source-database engine added to Redpanda Connect's CDC family. Redpanda Connect v4.83.0 ships the oracledb_cdc input (systems/redpanda-connect-oracle-cdc) riding on Oracle LogMiner — the Oracle Enterprise Edition redo-log-mining utility — as the Oracle-native CDC mechanism. Canonicalised as Oracle LogMiner CDC on the wiki, sibling to Postgres logical replication / MySQL binlog / MongoDB oplog / Spanner change streams / SQL Server change tables. Canonical structural contribution: fourth offset- durability classin-source checkpointing (progress persisted in a checkpoint table inside Oracle itself, "no external cache required, no re-snapshot, and no gaps"). The Oracle connector also canonicalises precision-aware type mapping — integers from NUMBER(p, 0)int64, decimals from NUMBER(p, s) with s > 0json.Number, driven off Oracle's ALL_TAB_COLUMNS data dictionary view and composed with Schema Registry for typed Avro encoding. Automatic mid- stream schema-drift detection (new columns detected automatically; dropped columns reflected after a restart). Oracle Wallet auth (cwallet.sso auto-login / ewallet.p12 PKCS#12) as the file-based credential-store substrate for regulated environments. The family now spans six engines (Postgres / MySQL / MongoDB / Spanner / MSSQL / Oracle). Competitive framing: single Go binary vs JVM + Kafka Connect cluster + Debezium. Enterprise-gated.

  • sources/2025-11-06-redpanda-253-delivers-near-instant-disaster-recovery-and-morefifth source-database engine added to Redpanda Connect's CDC family. Redpanda 25.3 (Redpanda Connect 4.67.5) ships the Microsoft SQL Server CDC input (microsoft_sql_server_cdc) riding on MSSQL's native change tables mechanism — "non-invasively captures every single insert, update, and delete from your SQL Server tables in real time". Joins the prior family (Postgres / MySQL / MongoDB / Spanner) from the 2025-03-18 post. Vendor benchmark: **~40 MB/sec sustained ingest

  • 3:15 initial snapshot on a 5M-row table versus an unnamed "alternative hosted Kafka + CDC service" at ~14.5 MB/s / ~8:04 snapshot**. Fits the CDC driver ecosystem pattern framing. Enterprise-gated.

  • sources/2025-03-18-redpanda-3-powerful-connectors-for-real-time-change-data-capturecanonical wiki disclosure of CDC as the flagship connector-ecosystem class of Redpanda Connect (Redpanda's Kafka Connect alternative). Canonicalises four per-engine CDC mechanisms surfaced as first-class input connectors: Postgres (logical replication + replication slot), MySQL (binlog + external cache for offsets), MongoDB (change streams / oplog

  • external offset store), Cloud Spanner (change streams with transactional-row offset storage + dynamic partition-split/merge handling). Load-bearing new canonical primitive: parallel snapshot of a single large table or collection — intra-table chunk parallelism during the snapshot phase, claimed as Redpanda's differentiator vs Debezium: "Debezium (Kafka Connect) does not do this today." Structural framing: CDC's offset-durability axis splits three ways across engines (server-owned Postgres slot / external consumer cache for MySQL + MongoDB / transactional-row for Spanner), each with different HA-coupling + retention-tracking trade-offs. Canonical wiki instance of CDC-as-connector-ecosystem-class with no-Kafka-Connect-cluster-in-the-middle topology.

  • sources/2024-07-29-aws-amazons-exabyte-scale-migration-from-apache-spark-to-ray-on-ec2 — CDC as the source-shape driving the scalability problem that motivated Amazon BDT's compactor work, first on Spark then on Ray; the insert/update/delete delta semantics are named explicitly.

  • sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite — Expedia names CDC as one of the two canonical use cases (alongside SCD) that make MERGE INTO the right default on systems/apache-iceberg; the CDC workload shape is precisely what motivates MOR over COW for write efficiency.
  • sources/2026-04-21-figma-keeping-it-100x-with-real-time-data-at-scaleCDC as the substrate of an online invalidation-based cache. LiveGraph's stateless invalidator tails Postgres's WAL logical replication stream per DB shard — the same CDC stream the database itself uses to keep replicas up-to-date — and emits invalidation messages computed from the mutation + the schema's query shapes. Distinct pattern from warehouse-style CDC compaction (concepts/copy-on-write-merge): here the CDC consumer is a latency-sensitive cache invalidator, not a periodic batch compactor. Demonstrates CDC as a first-class integration substrate between OLTP Postgres and downstream real- time data services, not just ETL / replication.
  • sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platformCDC as the substrate of a managed multi-tenant replication platform at company scale. Datadog built a unified CDC platform on Debezium + Kafka + Kafka Connect with five sink classes (Elasticsearch, Postgres-to-Postgres for shared-DB unwinding, Iceberg for analytics, Cassandra, cross-region Kafka). Operating points: page latency 30 s → 1 s (up to 97%), replication lag ~500 ms. Three new framings vs prior wiki CDC instances: (1) CDC as the primary split- database-and-search answer — explicitly the opposite direction from Cars24's Atlas-Search consolidation; (2) CDC provisioning as a workflow-engine problem ( Temporal decomposing the 7-step manual runbook); (3) CDC's schema-evolution hard-problem as a two-layer answer (offline + runtime). Named operating mode throughout: async, chosen explicitly over sync for scalability over strict consistency.
  • CDC as the public API surface of a sharded OLTP database. Matt Lord canonicalises the VStream gRPC API as Vitess's single CDC entrypoint: a unified keyspace-wide change stream across hundreds or thousands of shards, with a single VGTID progress token replacing per-shard cursor bookkeeping. Four driver ecosystems compose on the API (Debezium, Airbyte, Fivetran, PlanetScale Connect) — canonical wiki instance of the CDC driver ecosystem pattern. Key structural rule: "use a Vitess variant of the connector/driver rather than the MySQL one" — engine-native CDC tools are single-shard-blind on sharded databases, so the sharding layer must own the change stream or consumers silently miss rows. First canonical wiki datum on how a sharded RDBMS exposes CDC to downstream ETL ecosystems without delegating shard-enumeration to the consumer.
  • sources/2026-04-21-planetscale-postgres-high-availability-with-cdcCDC as the lens through which a substrate-level design critique compares Postgres and MySQL. Sam Lambert (PlanetScale CEO, 2025-09-12) argues the two engines sit on opposite ends of an action-log-vs-state-log design space (canonicalised as pattern). Postgres's logical replication slot — a primary-local catalog object that carries restart_lsn + confirmed_flush_lsn — creates an operational coupling between HA actions and CDC subscriber behaviour: primary promotion becomes conditional on the CDC client having recently advanced the slot, or on dropping the slot and forcing a subscriber re-snapshot. Postgres 17 failover slots mirror slot state into WAL but preserve the eligibility gate by design to preserve exactly-once semantics. MySQL's binlog is an action log with per-transaction GTIDs propagated by every replica running log_replica_updates=ON; CDC consumers persist their own GTID position, any replica is a valid resume point, and HA proceeds independently of CDC subscriber freshness. Canonical wiki framing: "slot progress is a single-node concern that must be coordinated across the cluster at failover time, and eligibility depends on subscriber behavior outside your control." First wiki source to canonicalise CDC as a design-critique lens rather than a capability lens.
  • sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platformsCDC as the upstream of the fan-out-to-many-consumers architecture. Canonicalises the pattern patterns/cdc-fanout-single-stream-to-many-consumers: one CDC reader against the source database feeds a single streaming topic that multiple downstream consumers (full-text search, analytics, vector index, reactive agent) subscribe to, instead of each consumer reading the source DB directly. Verbatim trade-off: "CDC streams can strain databases (e.g., by delaying WAL cleanup)" — single-reader topology bounds this to one slot / one retention-pin. Reactive-agent worked example: "triggering an agent when a user downgrades their plan can be done via the CDC stream on the user_plans table, without redesigning the application layer to support such reactivity." Composes with the wiki's streaming-as-backbone framing — CDC is the upstream connector class that feeds the backbone.

  • sources/2024-08-01-segment-0-6m-year-savings-by-using-s3-for-change-data-capture-for-dynamodbCDC as a batch-warehouse-feed changelog, with object storage as the cheapest materialisation substrate at petabyte scale. Twilio Segment's objects pipeline ran a changelog of a ~1 PB / ~958 billion-item DynamoDB table to feed warehouse integrations. Canonical framing: the changelog is a secondary index on (item-id, modified-timestamp) materialised outside the base store because the in-database answer (DynamoDB GSI) is priced out at petabyte scale — verbatim "due to the very large size of our table, creating a GSI for the table is not cost-efficient." concepts/gsi-cost-anti-pattern-at-petabyte-scale. V1 externalised the changelog to Bigtable on GCP (cross- cloud), V2 to S3 on AWS (single- cloud), saving ~$0.6M/year. Canonical wiki instance of patterns/object-store-as-cdc-log-store — object storage as the primary durable substrate for a CDC changelog feeding batch consumers (distinct from streaming CDC consumption). Distinguishing axis vs the rest of the Seen in entries: this is batch-oriented CDC consumption — the changelog sits at rest for hours-to-days before warehouse integrations pull from it, so the storage-unit-cost per byte dominates and per-record-latency does not. Raw markdown truncated at ~37 lines — V2 mechanism details (S3 prefix layout, compaction, read path) are not wiki-canonicalised.

  • sources/2026-04-22-databricks-stop-hand-coding-change-data-capture-pipelinesCDC as the workload class whose correctness envelope moves from hand-authored to declarative. Databricks' 2026-04-22 product-engineering post pitches AutoCDC inside Lakeflow SDP as the declarative replacement for hand-rolled MERGE logic across three input modes: native change data feed (CDF), CDF with SCD Type 2 history, and snapshot-diff inference (the third canonical CDC ingest shape on this wiki, after log-based capture and CDF). Canonicalises four semantic primitives as API parameters: keys, sequence_by, apply_as_deletes, stored_as_scd_type. The sequence_by column is the load-bearing primitive for out-of-sequence CDC event handling — a named concept separate from idempotency that hand-rolled pipelines routinely get wrong. Code footprint claim: 6–10 lines AutoCDC vs. 40–200+ lines hand-rolled MERGE; Fortune 500 Aerospace & Defense adopter quote: "4 lines of code could replace what I was doing in 1,500 lines of code before." Databricks Runtime perf disclosures since Nov 2025: 71% better perf-per-dollar on SCD Type 1, 96% on SCD Type 2, propagated to all AutoCDC pipelines because the declarative API lets Runtime-level optimisations apply universally. Named regulated-vertical adopters: Navy Federal Credit Union (billions of events/day), Block (pipeline dev: days → hours), Valora Group (Swiss retail). Framing note: Databricks explicitly argues LLM codegen does not fix hand-rolled CDC ("LLMs can generate code, but they don't understand your data") and positions Genie Code as the AI-codegen client that produces AutoCDC declarations, not raw MERGE. Canonical new pattern: patterns/declarative-cdc-over-hand-rolled-merge. First wiki source to pin declarative CDC as a distinct pattern axis separate from native-CDF vs log-based vs snapshot-diff input-mode taxonomy.

  • sources/2024-12-03-redpanda-redpanda-243-extends-lakehouses-with-streaming-data-cdcorigin-point first-engine announcement for the Redpanda Connect CDC family. Redpanda 24.3 ships the postgres_cdc input (Enterprise-tier, beta) framed as the "beginning of a larger CDC effort ... optimized for Redpanda Connect's native Go (vs. Debezium's Java)". First engine; MySQL flagged as next ("Stay tuned for our upcoming CDC connector for MySQL, which uses binary log file replication!"). Capability- statement altitude; multi-replication-mode framing asserted but modes not enumerated. Mechanism depth (logical replication

  • slots + parallel snapshot differentiation vs Debezium) is the 2025-03-18 CDC tour post.

  • DynamoDB Streams as a managed CDC feed driving a transactional-outbox relay. Zalando Payments's Order Store uses DynamoDB Streams configured with NEW_AND_OLD_IMAGES as the outbox of a transactional outbox: the primary table IS the outbox, and an AWS Lambda consumer emits domain events carrying both the full post-change item and a JSON-patch diff. Canonicalised as concepts/dynamodb-streams + patterns/dynamodb-streams-plus-lambda-outbox-relay. First wiki instance of CDC where the change stream and the primary table are the same storage object (no dedicated outbox table, no dual-write risk), framed explicitly against the 99.9% × 99.9% availability arithmetic.

  • canonical wiki framing of InnoDB's silent-cascade- in-binlog as a CDC pathology. Any CDC consumer tailing the MySQL binlog (Debezium, VStream, Airbyte / Fivetran / PlanetScale Connect) is structurally missing data for FK child changes driven by parent ON DELETE CASCADE / ON UPDATE CASCADE / SET NULL. Canonicalises the Noach + Gupta verbatim: "Any Change Data Capture (CDC) tool, that tails the binary log or masquerades as a replica, will be missing data when reading a child table's events when the child table has SET NULL or CASCADE foreign key actions. You cannot reliably replay events on those tables, and you will end up with corrupt data, or trying to apply an impossible statement." The PlanetScale fix: Vitess reimplements FK cascades above MySQL via application-level orchestration, so all cascaded child writes go through the binlog. Downstream CDC consumers on top of Vitess see complete event streams — the native MySQL/InnoDB binlog gap is closed at the Vitess layer.

Last updated · 542 distilled / 1,571 read