SYSTEM Cited by 5 sources
Debezium¶
Debezium is an open-source Change Data Capture (CDC) platform built on top of Apache Kafka Connect. It exposes per-database source connectors that tail a database's replication log (Postgres logical replication / MySQL binlog / MongoDB oplog / Cassandra commit log / …) and emit the change stream as keyed Kafka records, typically serialised as Avro with an associated schema published to a Kafka Schema Registry.
Stub page — expand on future Debezium-internals sources. The canonical wiki use case is Datadog's managed multi-tenant CDC replication platform, where Debezium is the source-side ingestion component of the Postgres/Cassandra → Kafka → sink-connector → Elasticsearch/Iceberg/Postgres pipeline.
Role in a CDC pipeline¶
Source DB ──(replication log: Postgres WAL / MySQL binlog / ...)──▶
Debezium source connector ──(Avro-serialised keyed records)──▶
Kafka topic + Kafka Schema Registry ──▶
Sink connector (Kafka Connect) ──▶ downstream system
Debezium serialises record values and their schemas together; schemas are pushed to the Schema Registry and compared against the stored schema under whatever compatibility mode is configured (Datadog runs backward compatibility — see patterns/schema-registry-backward-compat).
Prerequisites on the source database¶
For Postgres specifically, a Debezium source pipeline requires several operator-side configurations that Datadog explicitly enumerated in their 2025-11-04 retrospective:
- Enable logical replication by setting
wal_level=logical. - Create + configure Postgres users with the correct replication permissions.
- Establish replication objects — publishers (logical publications) and replication slots.
- Deploy Debezium instances (one or more, typically one per source database or shard) to capture changes.
- Create Kafka topics with appropriate partitioning and ensure each Debezium instance maps correctly to its topics.
- Set up heartbeat tables — a known Debezium requirement to advance the replication slot's LSN during quiet periods so the Postgres WAL can be recycled (unacked slots pin WAL indefinitely).
- Configure sink connectors (the downstream half) to move data from Kafka into the target system.
(Source: sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform)
That 7-step manual runbook is the class of operational-complexity problem that motivated Datadog to build a Temporal- orchestrated automation layer over Debezium provisioning.
Schema evolution¶
Debezium emits schema updates alongside data when a source-DB schema change is captured; in Datadog's platform these are serialised to Avro and the schema is compared against the Schema Registry's stored schema under backward-compatibility rules. This limits schema changes to safe operations (adding optional fields, removing existing optional fields) without breaking downstream consumers.
The companion offline layer is Datadog's internal automated
schema management validation system which analyses migration SQL
before it's applied to the database to catch
pipeline-breaking changes (e.g. ALTER TABLE ... ALTER COLUMN
... SET NOT NULL on a column that in-flight Debezium messages
might not populate). See patterns/schema-validation-before-deploy.
Seen in¶
-
sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale — canonical wiki introduction of Debezium 3.4's configurable position-tracking properties
lsn.flush.modeandoffset.mismatch.strategy(both Zalando contributions, 2025-12-16). The post is the sequel to the 2023-11 pgjdbc upstream fix: Debezium hard-disabled the pgjdbc KeepAlive-LSN-advancement feature that Zalando shipped in 2023 because it created legitimate slot-ahead-of-offset startup mismatches that broke the operator contract for most Debezium users. Zalando's remediation is two properties that make the previously- implicit position-tracking posture explicit per deployment:lsn.flush.mode=connector_and_driverre-enables the pgjdbc flush as opt-in;offset.mismatch.strategy=trust_slot/trust_greater_lsnlets operators who trust the slot as authoritative advance the offset to match on startup. The post canonicalises slot-vs-offset position tracking as a structural two-location problem,MemoryOffsetBackingStoreas Zalando's ephemeral-offset- store choice that makes the slot authoritative by construction, and Patroni-managed slot- survives-failover discipline as the operational precondition. Both contributions ship with deprecated-boolean → new-enum auto-mapping per patterns/backward-compatible-config-migration. Scale context at publication: "hundreds of event streams" processing "hundreds of thousands of events per second" across Zalando's 100+ Kubernetes clusters; Zalando ran the feature for nearly two years on Debezium 2.7.4 with pgjdbc 42.7.2 and zero detected data loss before the disable-by-Debezium forced the opt-in redesign. -
sources/2026-04-09-redpanda-oracle-cdc-now-available-in-redpanda-connect — named competitive foil on the Oracle CDC axis. Redpanda's 2026-04-09 launch positions its single-Go-binary
oracledb_cdcconnector explicitly against Kafka-Connect-hosted Debezium. Verbatim: "Debezium is a solid project. If your team is already running Kafka Connect for other connectors, adding Oracle CDC on top is a reasonable lift. But if you're not already running Kafka Connect, you're standing up a significant amount of infrastructure — dedicated workers, connector offsets, a JVM heap to size, its own monitoring surface — for what should be a data pipeline." Both consumers ride on the same underlying substrate (Oracle LogMiner); the architectural split is at the host framework (Kafka Connect JVM vs Redpanda Connect single Go binary) and the offset-durability class (Kafka Connect offset topics vs in-source checkpointing in Oracle itself). Canonical wiki instance of the CDC driver ecosystem pattern on the Oracle engine — the same source DB supports multiple consumer-side ecosystems writing per-engine drivers against the LogMiner substrate. -
sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — core ingestion component of Datadog's managed multi-tenant CDC replication platform. Postgres-to-Elasticsearch was the seed pipeline; the platform generalised to Postgres→Postgres, Postgres→Iceberg, Cassandra→X, and cross-region Kafka replication. Datadog maintains custom forks of Debezium/Kafka-Connect where OSS transforms fell short, to introduce Datadog-specific logic and optimisations.
-
— canonical wiki disclosure of the Debezium Vitess connector as a consumer of VStream's keyspace-wide change stream. Matt Lord names Debezium as one of four driver ecosystems composing on the VStream API alongside Airbyte, Fivetran, and PlanetScale Connect — canonical wiki instance of the CDC driver ecosystem pattern. Critical framing: "use a Vitess variant of the connector/driver rather than the MySQL one" — the Debezium MySQL connector is sharding-blind against a Vitess cluster (it sees a single shard's binlog); the Debezium Vitess connector talks to VTGate's VStream and gets the unified keyspace-wide stream by construction. Canonical wiki framing of Debezium as a per-engine connector family where the sharding-layer variant is structurally distinct from the single-instance variant.
-
sources/2025-03-18-redpanda-3-powerful-connectors-for-real-time-change-data-capture — named foil against Redpanda Connect's four-connector CDC family. Canonical competitive claim verbatim: "Debezium (Kafka Connect) does not do this today" — Redpanda's PostgreSQL and MongoDB CDC connectors ship parallel snapshot of a single large table, which the stock upstream Debezium distribution does not at publication. Framing establishes Debezium as the Kafka-Connect-ecosystem reference point against which a CDC-connector family positions itself architecturally — snapshot parallelism granularity (inter-table vs intra-table) becomes the load-bearing differentiator.
-
sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction of Debezium Engine (the embedded-library mode of Debezium, distinct from Kafka Connect mode) and of the pgjdbc transitive dependency that every JVM Debezium deployment (Engine or Connect) relies on for Postgres logical-replication wire protocol. Zalando runs "hundreds of Postgres-sourced event streams" on Debezium Engine (systems/zalando-postgres-event-streams); at that scale the runaway WAL-growth bug in pgjdbc's KeepAlive handling surfaced as a fleet-wide pain. Zalando upstreamed the fix to pgjdbc (PR #2941, shipped in pgjdbc 42.7.0) — canonical instance of patterns/client-driver-fix-over-application-workaround and a concrete Debezium-adjacent application of patterns/upstream-the-fix. The post also canonicalises heartbeat tables (Debezium's built-in formalisation of the dummy-write kludge) as the application-layer alternative to the driver- layer fix.
Related¶
- systems/kafka-connect — Debezium is implemented as a set of Kafka Connect source connectors.
- systems/kafka-schema-registry — where Debezium-emitted schemas are validated under backward-compat rules.
- systems/kafka — destination for CDC records.
- systems/postgresql — primary source database in Datadog's
platform;
wal_level=logicalis the feature gate. - concepts/change-data-capture — the upstream concept.
- concepts/wal-write-ahead-logging — Postgres logical
replication ≡ WAL streaming under
wal_level=logical. - concepts/logical-replication — row-level replication derived from the WAL; Debezium's Postgres feed.
- patterns/debezium-kafka-connect-cdc-pipeline — the full-stack pattern.
- systems/vitess — Debezium's Vitess source connector composes on VStream to get a single unified change stream across all shards of a keyspace.
- systems/vitess-vstream — the Vitess-side API Debezium's Vitess connector consumes.
- systems/planetscale-connect — PlanetScale's first-party managed consumer of the same VStream API.
- concepts/unified-change-stream-across-shards — the structural reason Vitess has its own Debezium connector rather than reusing the MySQL one.
- patterns/cdc-driver-ecosystem — the pattern Debezium is itself a canonical instance of (per-engine source connector set composed on per-engine vendor APIs).
- systems/debezium-engine — Debezium's embedded-library mode, the deployment shape Zalando's event- streaming platform uses.
- systems/pgjdbc-postgres-jdbc-driver — the Java driver every Debezium-Postgres pipeline transitively depends on.
- systems/zalando-postgres-event-streams — canonical Debezium Engine fleet deployment; motivated Zalando's upstream pgjdbc fix.
- concepts/runaway-wal-growth — the failure mode that surfaces at Debezium-deployment scale against low-traffic tables.
- concepts/keepalive-message-lsn-advancement — the pgjdbc 42.7.0+ mechanism that prevents it.
- concepts/dummy-write-heartbeat-kludge — Debezium's own heartbeat-interval config is the application-layer formalisation of this kludge.
- patterns/client-driver-fix-over-application-workaround — the architectural lever behind the pgjdbc upstream fix.
- systems/patroni — the HA substrate that enables slot-authoritative posture via slot-survives-failover discipline.
- concepts/lsn-flush-mode — Debezium 3.4's three-mode enum controlling who may flush LSNs (Zalando DBZ-9641).
- concepts/offset-mismatch-strategy — Debezium 3.4's four-strategy enum controlling slot-vs-offset startup mismatch resolution (Zalando DBZ-9688).
- concepts/slot-vs-offset-position-tracking — the structural two-location problem the 3.4 properties address.
- concepts/memory-offset-backing-store — Zalando's ephemeral-offset-store choice enabling slot-authoritative posture by construction.
- patterns/opt-in-driver-level-lsn-flush — the pattern
lsn.flush.modeembodies. - patterns/authoritative-slot-over-authoritative-offset
— the pattern
offset.mismatch.strategymakes operator-configurable. - patterns/backward-compatible-config-migration — boolean → enum auto-mapping discipline used in both 3.4 contributions.