Zalando — Contributing to Debezium: Fixing Logical Replication at Scale¶
Summary¶
Zalando's 2025-12-18 engineering post is the sequel to their 2023-11 pgjdbc upstream fix (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver). When a subsequent Debezium release hard-disabled the pgjdbc KeepAlive-LSN-advancement feature that Zalando had shipped in pgjdbc 42.7.0, Zalando's upgrade path broke — the fix that kept their fleet of hundreds of Postgres-sourced event streams stable from runaway WAL growth was no longer reachable through configuration. The post describes the two follow-up upstream contributions Zalando made to Debezium (shipped in Debezium 3.4.0.Final, 2025-12-16):
- DBZ-9641 /
PR #6881
—
lsn.flush.modeconfig property (replacing the deprecatedflush.lsn.sourceboolean) with three modes —manual,connector(default),connector_and_driver— making the pgjdbc keepalive-flush feature opt-in for operators who can verify it's safe for their deployment. - DBZ-9688 /
PR #6948
—
offset.mismatch.strategyconfig property (replacing the deprecatedinternal.slot.seek.to.known.offset.on.startboolean) with four strategies —no_validation(default),trust_offset,trust_slot,trust_greater_lsn— letting users opt-in to treating the Postgres replication slot as the authoritative position-tracking source when their operational reality supports it.
The load-bearing architectural insight: logical-replication
position is tracked in two places (Debezium's offset store
+ the Postgres replication slot's confirmed_flush_lsn), and
the right answer to "which wins on startup mismatch?" depends on
operator-side invariants that Debezium cannot know. Zalando
trusted the slot because they ran Patroni-
managed Postgres with slot-survives-failover discipline since
2018; most Debezium users trust the offset store because they
run Kafka Connect offset topics as
durable truth. The fix is to make the choice explicit per
deployment, not to force a universal default.
Key takeaways¶
-
The trigger: Debezium hard-disabled Zalando's 2023 pgjdbc fix. A Debezium PR (#6472) hard-coded
withAutomaticFlush(false)in the replication stream builder, disabling the pgjdbc KeepAlive-LSN- advancement feature Zalando had shipped in pgjdbc 42.7.0. "For us, this was a blocker because we couldn't upgrade Debezium without losing the fix that kept our production systems stable." Debezium had good reasons — the feature conflicted with Debezium's own LSN management logic and users had reported issues — but a global disable foreclosed on Zalando's proven-at-scale deployment. (Source: sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale) -
First contribution —
lsn.flush.modemakes the pgjdbc keepalive-flush opt-in. Zalando's DBZ-9641 / PR #6881 introduces a three-mode enum: manual— LSN flushing managed externally; connector does not flush at all.connector(default) — Debezium flushes after each logical-replication change event; pgjdbc keep-alive thread does not flush LSNs.connector_and_driver— both Debezium and pgjdbc may flush; pgjdbc's keep-alive flush advances the slot when the connector has no pending LSN to flush, preventing WAL accumulation from WAL activity outside monitored tables (CHECKPOINT / VACUUM /pg_switch_wal()).
Backward-compat: the deprecated boolean flush.lsn.source
auto-maps (true → connector, false → manual).
Canonical instance of patterns/opt-in-driver-level-lsn-flush.
-
The real problem: slot and offset can legitimately disagree on startup. Debezium tracks position in an offset store (Kafka / memory / other backing store); Postgres tracks position in the replication slot's
confirmed_flush_lsn. On startup the two may differ, and there are structural reasons for the mismatch — the pgjdbc keepalive flush can advance the slot past the stored offset because the driver is flushing unmonitored WAL activity (vacuums and checkpoints that produce no logical-replication change events), which the connector cannot flush itself. -
Debezium's pre-3.4 default silently failed on mismatch. Pre-3.4 behaviour: connector attempts to stream from the stored offset without slot-state validation; if the requested LSN is no longer in WAL, Postgres returns a cryptic error. The optional strict flag
internal.slot.seek.to.known.offset.on.start=truewould immediately fail with "Saved offset is before replication slot's confirmed lsn," forcing the user to full-resync their database — even though no actual data had been lost. Both paths were operationally unsafe for the legitimateconnector_and_driverkeepalive-flush shape. -
Second contribution —
offset.mismatch.strategylets operators pick which source of truth to trust. Zalando's DBZ-9688 / PR #6948 introduces a four-value enum inspired by Kafka'sauto.offset.reset: no_validation(default) — stream from stored offset without validating slot state. Preserves pre-3.4 default behaviour.trust_offset— fail if the offset is behind the slot (indicating potential data loss); advance the slot to the offset viapg_replication_slot_advance()when the offset is ahead. Replacesinternal.slot.seek.to.known.offset.on.start=true.trust_slot— slot is authoritative. If the offset is behind the slot, advance the offset to match the slot (skip replaying events between the two). Pairs withlsn.flush.mode=connector_and_driverfor production-ready use with durable offset stores.trust_greater_lsn— bidirectional sync; advance whichever side is behind.
Backward-compat: the deprecated
internal.slot.seek.to.known.offset.on.start boolean
auto-maps (false → no_validation, true →
trust_offset). Canonical instance of
patterns/authoritative-slot-over-authoritative-offset
as a per-deployment architectural posture rather than a
framework-imposed default.
-
Why Zalando trusts the slot (and most users don't). Zalando ran with
MemoryOffsetBackingStoresince 2018 — an ephemeral offset store — explicitly because they treated the Postgres replication slot as the authoritative source of stream position. The enabler is their long-running Patroni-managed HA Postgres stack — since the mid-2010s Zalando has run Postgres with automatic failover and replication-slot management that ensures slots survive failovers. "From day one of our logical replication rollout in late 2018, we implemented replication slot management that ensured slots survived failovers, so we could confidently trust the slot position as durable and correct." Most other Debezium users run Kafka Connect offset topics as durable ground truth and treat the replication slot as a Postgres implementation detail — which is why the keepalive-flush's slot-advance-past-offset behaviour broke their operator contract. -
Recovery use case: manual slot advance past corrupted WAL.
trust_slotandtrust_greater_lsnalso help with operator-driven recovery — when an operator needs to advance a slot past corrupted WAL usingpg_replication_slot_advance(), the connector can be configured to respect that change instead of refusing to start. Converts an otherwise-unrecoverable situation (full re-snapshot) into a targeted slot-advance operation. -
Recommended configuration combinations. The closing prescription ties the two properties together per use case:
- WAL growth on low-activity databases →
lsn.flush.mode=connector_and_driver+offset.mismatch.strategy=trust_greater_lsn(for persistent offset stores). - Manual recovery past corrupted WAL →
offset.mismatch.strategy=trust_slotortrust_greater_lsn. -
Maximum safety / detect unexpected slot changes →
offset.mismatch.strategy=trust_offset— catches potential data-loss scenarios early. -
Backward-compatibility via auto-mapping deprecated booleans. Both contributions follow the same migration pattern: deprecate a boolean, introduce an enum, map the boolean's values to equivalent enum values at load time. Canonical instance of patterns/backward-compatible-config-migration — gives users a deprecation window to adopt the new config without breaking existing deployments.
Architecture: the two-location position-tracking problem¶
Logical-replication position is persisted in two independent locations:
- Subscriber side — Debezium's offset store (Kafka Connect offset topic, Debezium Engine backing store, in-memory store). Durability properties depend on the store.
- Primary side — the Postgres
logical
replication slot's
confirmed_flush_lsn(advances when the client acks) andrestart_lsn(oldest WAL still required).
On a clean run both track in lockstep: connector acks an event → offset moves forward → slot moves forward. They diverge when someone advances the slot without the connector acking an event:
- pgjdbc keepalive flush (Zalando's 2023 fix) — driver acks an LSN reported in a server KeepAlive when all observed replication messages are flushed. Advances the slot; connector offset stays put until the next change event.
pg_replication_slot_advance()(operator action) — manual recovery primitive to skip past corrupted WAL. Slot advances; offset store is untouched.
Post-restart, the connector must decide: the stored offset and
the slot's confirmed_flush_lsn disagree. Which wins?
| Scenario | Slot LSN | Offset LSN | Safe action |
|---|---|---|---|
| Clean run | Equal | Equal | Stream from either |
| Keepalive-flushed while idle | Ahead | Behind | Advance offset to slot (slot is correct) |
| Connector crashed mid-batch | Equal / behind | Ahead | Advance slot to offset (replay nothing) |
Operator ran pg_replication_slot_advance() |
Ahead | Behind | Advance offset to slot (operator asserted skip) |
| Connector fresh-snapshot + corrupted slot | Way ahead | Way behind | Fail — real data loss |
No single default is correct across all four rows. Zalando's contribution is to make the decision explicit and configurable per deployment.
Why Zalando's architecture is structurally unusual¶
Three properties compose to make Zalando "apparently the only ones who liked" the pgjdbc keepalive flush:
MemoryOffsetBackingStore— ephemeral offset store by design. On every connector restart the slot is authoritative; no offset to disagree with. Most Debezium users run Kafka Connect offset topics (persistent) where this shape doesn't apply.- Patroni-managed Postgres — replication slots survive automatic failover by construction. Slots are not a primary-local detail that evaporates on promotion; they're durable-across-failover. Most Debezium users run Postgres where slot loss on failover is an accepted operational hazard.
- Long deployment track record — "ran it for nearly two years, processing billions of events with zero detected data loss from this mechanism." Gave Zalando empirical confidence to continue with the pgjdbc keepalive flush even after Debezium disabled it upstream.
Absent any one of the three, trusting the slot over the offset store is operationally unsafe. Absent all three (the typical Debezium-with-Kafka-Connect deployment), the keepalive flush must be disabled-by-default or it breaks the operator contract — which is why Debezium was right to hard-code it off, and why the correct remediation is opt-in, not a new default.
Operational numbers¶
- Scale at publication: "hundreds of event streams" processing "hundreds of thousands of events per second" across Zalando's 100+ Kubernetes clusters at peak.
- History: infrastructure in operation since late 2018; billions of events processed.
- Pre-Debezium-disable production run: Debezium 2.7.4 pinned to pgjdbc 42.7.2, nearly two years, zero detected data loss from the keepalive-flush mechanism.
- Debezium release with both fixes: 3.4.0.Final (2025-12-16).
- Jira issues: DBZ-9641 (
lsn.flush.mode), DBZ-9688 (offset.mismatch.strategy). - PRs: #6881, #6948.
Caveats and what the post does not cover¶
- No quantitative latency / throughput / WAL-size before-after graphs for the Debezium-3.4 rollout; the post is an architectural retrospective rather than a benchmarking report.
- The third-party Airbyte issue is cited as a symptom example but not deeply analysed; Zalando focuses on the structural design rather than the surface-level failure mode.
- The post does not describe Zalando's internal rollout discipline for the Debezium 3.4 upgrade (unlike the 2023 post's detailed patched-vs-unpatched Docker image split rollout).
- The
trust_slot+connector_and_drivercombination's failure modes when Patroni-equivalent slot-durability discipline is absent are not benchmarked; Zalando prescribes the combination but doesn't walk through what happens to a user who adopts it without the slot-survival invariant. - No empirical comparison with alternative CDC-engine approaches to the same problem (Redpanda Connect's in-source checkpointing, e.g.).
Source¶
- Original: https://engineering.zalando.com/posts/2025/12/contributing-to-debezium.html
- Raw markdown:
raw/zalando/2025-12-18-contributing-to-debezium-fixing-logical-replication-at-scale-29ed646d.md - Jira: DBZ-9641, DBZ-9688
- PRs: debezium/debezium#6881, debezium/debezium#6948
- Debezium release notes: Debezium 3.4.0.Final
- Prior Zalando post: sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver (the 2023 pgjdbc upstream fix this post is a sequel to)
Related¶
- systems/debezium — the framework the contributions landed in.
- systems/debezium-engine — Zalando's embedded deployment mode; the shape the contributions unblock at Zalando.
- systems/pgjdbc-postgres-jdbc-driver — the driver whose keepalive-flush feature the first contribution re-enables.
- systems/zalando-postgres-event-streams — the platform context; hundreds of streams on 100+ K8s clusters.
- systems/postgresql — the source database.
- systems/patroni — the HA substrate that makes slot- survives-failover a deployment-wide invariant.
- systems/kafka-connect — the default offset-store
environment where
trust_offsetis the right posture. - concepts/logical-replication — the replication mode.
- concepts/postgres-logical-replication-slot — the primary-side position.
- concepts/runaway-wal-growth — the failure mode the feature prevents.
- concepts/keepalive-message-lsn-advancement — the pgjdbc mechanism the 3.4 opt-in re-unlocks.
- concepts/lsn-flush-mode — the first new config property canonicalised.
- concepts/offset-mismatch-strategy — the second new config property canonicalised.
- concepts/slot-vs-offset-position-tracking — the structural problem both properties address.
- concepts/memory-offset-backing-store — Zalando's choice of ephemeral offset store.
- concepts/external-offset-store — the alternative (Kafka Connect offset topics) used by most Debezium deployments.
- patterns/opt-in-driver-level-lsn-flush — the
pattern the
lsn.flush.modeproperty embodies. - patterns/authoritative-slot-over-authoritative-offset
— the pattern the
offset.mismatch.strategyproperty makes configurable. - patterns/client-driver-fix-over-application-workaround — the umbrella architectural lever (2023 pgjdbc fix at driver altitude; 2025 Debezium fix at framework altitude).
- patterns/upstream-the-fix — the broader cross-company pattern.
- patterns/backward-compatible-config-migration — boolean → enum auto-mapping discipline.
- companies/zalando — author.