Redpanda — 3 powerful connectors for real-time change data capture¶
Summary¶
Redpanda (2025-03-18) publishes a product-altitude tour of the four
CDC input connectors shipped with
Redpanda Connect — the company's Kafka-Connect alternative — and
positions them competitively against Debezium on
a single architectural axis: parallelised snapshotting of a single
large table or collection. The post canonicalises four per-engine
CDC implementations (postgres_cdc / mysql_cdc / mongodb_cdc /
gcp_spanner_cdc) and the load-bearing shape common to all four: a
snapshot phase → streaming phase transition rooted in each
engine's native change log (PostgreSQL
WAL under wal_level=logical, MySQL
binlog, MongoDB
change streams / oplog,
Spanner change streams). The
differentiating claim ("Debezium does not do this today") is that
Redpanda's Postgres and MongoDB connectors can shard a single large
table or collection into chunks and read them in parallel during the
snapshot phase — canonicalised on the wiki as
parallel snapshot. Tier-3
substrate-qualifying: canonicalises four per-engine CDC mechanisms
and introduces the parallel-snapshot-of-large-table variant as a
distinct wiki primitive.
Key takeaways¶
-
Redpanda Connect is the Kafka-Connect alternative here; CDC inputs are its flagship source class. "With hundreds of configurable connectors, Redpanda Connect is a fresh alternative to Kafka Connect that's more flexible, scalable, and simpler to deploy so you can easily integrate disparate data systems." Positions the connector ecosystem as composing with the streaming-broker layer (Redpanda proper) to deliver end-to-end pipelines without a separate Kafka Connect cluster.
-
Snapshots + parallel snapshots are the defining differentiator vs Debezium. Canonical framing verbatim: "Snapshots capture a copy of the entire database state. Our connector reads the snapshot into Redpanda to capture the database's current state before transitioning to real-time change streaming (i.e. streaming new data). Parallel snapshots allow multiple tables (PostgreSQL) or collections (MongoDB) to be read concurrently, radically reducing the time needed to complete the snapshot phase. Now here's where we're different: Redpanda's PostgreSQL and MongoDB CDC connectors can also parallelise reads for large tables. That means tables or collections with millions of records can be split into smaller chunks and read in parallel." The claimed Debezium gap is explicit: "Debezium (Kafka Connect) does not do this today."
-
Postgres CDC rides on logical replication + replication slot. "PostgreSQL captures changes at a transaction-level via logical replication, ensuring data consistency by only streaming fully committed transactions. No need to worry about partial or rolled-back data." Canonical mechanism verbatim: "When creating a replication slot in PostgreSQL, the connector exports consistent snapshot of your tables and seamlessly transitions to streaming ongoing changes from that snapshot point." The replication slot provides the transition boundary between snapshot and stream phases — the slot's LSN is the source-side offset checkpoint integrated with Redpanda Connect's at-least-once delivery guarantees: "PostgreSQL's built-in replication slots offset tracking and checkpointing is integrated with Redpanda Connect's at-least-once delivery guarantees ensuring you never drop any data."
-
MySQL CDC rides on the binlog, and requires an external cache for offset storage. Canonical verbatim: "MySQL CDC uses binlog positions to track changes, requiring an external cache (Redis, a SQL database, or another datastore) to store binlog offsets." Canonicalised as external-offset-store — a structurally distinct mechanism from Postgres's in-database replication slot: Postgres owns offset durability via the slot catalog object, MySQL delegates it to the CDC consumer. "For consistency, this connector gets a global read lock during initial snapshots, records the binlog position, and then releases the lock to stream data from that precise point forward." Topology scope explicitly limited: "Currently supports standard MySQL setups and primary-replica configurations, with plans to extend support for high-availability clusters and Global Transaction ID (GTID) environments." GTID support absent at publication.
-
MongoDB CDC rides on change streams + oplog, and gets parallel snapshots by splitting collections into chunks. "The connector employs parallel reads during snapshots, significantly boosting performance for large-scale data migrations by splitting collections into manageable chunks." Plus flexible document modes: "Customizable document handling for updates and deletes, supporting full-document lookups and pre/post image capture." Like MySQL, MongoDB CDC also requires external offset stores: "Uses external stores for oplog positions, similar to MySQL, giving you control over your checkpointing strategy."
-
Cloud Spanner CDC rides on change streams with transactionally-stored progress and automatic partition splitting. Canonical verbatim claims: "Reads from a specified change stream within a Spanner database and converts each change record into a message" + "Automatically processes partitions that are merged and split, avoiding hotspots" + "Stores progress transactionally in a configurable spanner table for at least once delivery." Canonicalised as concepts/spanner-change-stream — the dynamic partition split/merge handling is the structural distinguishing feature from the static-partition-count CDC shape in Postgres / MySQL / MongoDB.
-
Snapshot-to-streaming transition is the load-bearing shape across all four engines. Canonical two-phase lifecycle — captured snapshot of current state followed by transition-at-offset to the engine's change log — maps cleanly onto snapshot-plus-catch-up replication. The differentiator is how each engine marks the transition point: Postgres uses a replication slot LSN; MySQL uses a binlog position captured under a global read lock; MongoDB uses an oplog timestamp; Spanner uses a change-stream partition-sequence token stored transactionally. Redpanda's innovation is not the two-phase shape itself (Debezium has it too) but the intra-table parallelism during phase 1.
Structural framings canonicalised¶
-
Parallel snapshot of a single large table/collection — the new concepts/parallel-snapshot-cdc concept splits one table or collection into chunks (not one-stream-per-table) to accelerate the initial snapshot phase, and claims Debezium does not support this. The existing snapshot-plus-catchup-replication pattern canonicalised parallelism across tables as an optimisation ("Parallelise across tables (one stream per table or per destination shard)"); this post extends the primitive one level down — intra-table parallelism during snapshot.
-
Offset durability as a structural axis — canonical three-way split: Postgres owns offset durability via a server-side replication slot (no external cache), MySQL and MongoDB delegate offset durability to an external cache (Redis, SQL, any datastore), and Spanner stores offsets transactionally in a Spanner table. The latter two put the reliability burden on the operator but allow heterogeneous checkpointing topologies.
-
Redpanda Connect as composition peer to Redpanda proper — the post's implicit architecture is input connectors feeding Redpanda topics with no Kafka Connect cluster in between. This is the vendor-integrated alternative to the Debezium + Kafka Connect CDC pipeline, positioned as "simpler to deploy".
-
CDC driver ecosystem instance — Redpanda Connect is the Debezium-ecosystem analogue for the Redpanda/Kafka world: one per-engine connector per database kind, each riding on that engine's native change log. Same shape as Debezium's Postgres / MySQL / MongoDB / Cassandra / Vitess connector family.
Operational numbers¶
No benchmarks, no latency measurements, no chunk-size recommendations, no production case studies disclosed. The post is entirely a feature-tour; the parallel-snapshot speedup is asserted ("radically reducing the time needed to complete the snapshot phase") without quantification.
Caveats¶
- Product-announcement altitude. Feature tour, not architectural deep-dive. No chunk-splitting algorithm disclosed (how does the connector pick chunk boundaries for an unsharded table? By primary key range? By internal page ranges? Unknown). No parallelism upper bound. No coordination semantics between parallel snapshot readers (do they use a single transaction's snapshot, or independent transactions?).
- Debezium-gap claim is unverified. The post's load-bearing competitive claim ("Debezium does not do this today") is true for the stock upstream Debezium distribution as of publication (2025-03-18), but Debezium's 2.x blocking-snapshot metadata signalling and Debezium Server's custom sinks arguably give operators the ability to bolt on parallel-snapshot behaviour. Redpanda's stronger claim is that their connectors ship parallel snapshot as a default-on configuration.
- MySQL CDC topology scope explicitly restricted. No GTID, no multi-source replication, no Group Replication — covered only async primary-replica. This is a significant gap vs Debezium's MySQL connector which supports GTID-based offset tracking and is the 2025 industry default for MySQL CDC.
- No binlog-retention-vs-consumer-lag coupling discussion. The MySQL CDC connector requires the external cache to persist binlog offsets durably, but the post doesn't address what happens when a consumer's cached offset falls behind the source's binlog retention horizon — the same failure mode canonicalised in Matt Lord's VReplication post that the Postgres slot mechanism structurally prevents.
- No MongoDB oplog retention discussion. Analogous gap for the MongoDB CDC connector.
- Spanner partition-split/merge handling hand-waved. The post asserts "automatically processes partitions that are merged and split, avoiding hotspots" without disclosing the coordination mechanism — how does the connector detect a partition split? How does it avoid duplicate reads at the split boundary? How does transactional progress storage interact with partition topology changes?
- Vendor-competitive framing. Debezium is the named foil throughout; no engagement with Kafka Connect's own ecosystem depth, offset-storage framework, or operational maturity. The post is pitched as upgrade-path marketing.
- Enterprise-license gating. "All of the connectors mentioned in this blog are currently available in Redpanda Cloud and Self-Managed with an Enterprise license." Parallel snapshots are not a free-tier feature.
Source¶
- Original: https://www.redpanda.com/blog/cdc-connectors-real-time-data-streaming
- Raw markdown:
raw/redpanda/2025-03-18-3-powerful-connectors-for-real-time-change-data-capture-0f2eb455.md
Related¶
- companies/redpanda
- systems/redpanda-connect
- systems/redpanda
- systems/debezium
- systems/postgresql
- systems/mysql
- systems/mongodb-server
- systems/cloud-spanner
- systems/kafka-connect
- concepts/change-data-capture
- concepts/logical-replication
- concepts/binlog-replication
- concepts/mongodb-change-streams
- concepts/spanner-change-stream
- concepts/parallel-snapshot-cdc
- concepts/external-offset-store
- concepts/postgres-logical-replication-slot
- concepts/postgres-wal-level-logical
- patterns/snapshot-plus-catchup-replication
- patterns/cdc-driver-ecosystem
- patterns/debezium-kafka-connect-cdc-pipeline