Skip to content

ZALANDO 2025-12-18

Read original ↗

Zalando — Contributing to Debezium: Fixing Logical Replication at Scale

Summary

Zalando's 2025-12-18 engineering post is the sequel to their 2023-11 pgjdbc upstream fix (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver). When a subsequent Debezium release hard-disabled the pgjdbc KeepAlive-LSN-advancement feature that Zalando had shipped in pgjdbc 42.7.0, Zalando's upgrade path broke — the fix that kept their fleet of hundreds of Postgres-sourced event streams stable from runaway WAL growth was no longer reachable through configuration. The post describes the two follow-up upstream contributions Zalando made to Debezium (shipped in Debezium 3.4.0.Final, 2025-12-16):

  1. DBZ-9641 / PR #6881lsn.flush.mode config property (replacing the deprecated flush.lsn.source boolean) with three modes — manual, connector (default), connector_and_driver — making the pgjdbc keepalive-flush feature opt-in for operators who can verify it's safe for their deployment.
  2. DBZ-9688 / PR #6948offset.mismatch.strategy config property (replacing the deprecated internal.slot.seek.to.known.offset.on.start boolean) with four strategies — no_validation (default), trust_offset, trust_slot, trust_greater_lsn — letting users opt-in to treating the Postgres replication slot as the authoritative position-tracking source when their operational reality supports it.

The load-bearing architectural insight: logical-replication position is tracked in two places (Debezium's offset store + the Postgres replication slot's confirmed_flush_lsn), and the right answer to "which wins on startup mismatch?" depends on operator-side invariants that Debezium cannot know. Zalando trusted the slot because they ran Patroni- managed Postgres with slot-survives-failover discipline since 2018; most Debezium users trust the offset store because they run Kafka Connect offset topics as durable truth. The fix is to make the choice explicit per deployment, not to force a universal default.

Key takeaways

  1. The trigger: Debezium hard-disabled Zalando's 2023 pgjdbc fix. A Debezium PR (#6472) hard-coded withAutomaticFlush(false) in the replication stream builder, disabling the pgjdbc KeepAlive-LSN- advancement feature Zalando had shipped in pgjdbc 42.7.0. "For us, this was a blocker because we couldn't upgrade Debezium without losing the fix that kept our production systems stable." Debezium had good reasons — the feature conflicted with Debezium's own LSN management logic and users had reported issues — but a global disable foreclosed on Zalando's proven-at-scale deployment. (Source: sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale)

  2. First contribution — lsn.flush.mode makes the pgjdbc keepalive-flush opt-in. Zalando's DBZ-9641 / PR #6881 introduces a three-mode enum:

  3. manual — LSN flushing managed externally; connector does not flush at all.
  4. connector (default) — Debezium flushes after each logical-replication change event; pgjdbc keep-alive thread does not flush LSNs.
  5. connector_and_driver — both Debezium and pgjdbc may flush; pgjdbc's keep-alive flush advances the slot when the connector has no pending LSN to flush, preventing WAL accumulation from WAL activity outside monitored tables (CHECKPOINT / VACUUM / pg_switch_wal()).

Backward-compat: the deprecated boolean flush.lsn.source auto-maps (trueconnector, falsemanual). Canonical instance of patterns/opt-in-driver-level-lsn-flush.

  1. The real problem: slot and offset can legitimately disagree on startup. Debezium tracks position in an offset store (Kafka / memory / other backing store); Postgres tracks position in the replication slot's confirmed_flush_lsn. On startup the two may differ, and there are structural reasons for the mismatch — the pgjdbc keepalive flush can advance the slot past the stored offset because the driver is flushing unmonitored WAL activity (vacuums and checkpoints that produce no logical-replication change events), which the connector cannot flush itself.

  2. Debezium's pre-3.4 default silently failed on mismatch. Pre-3.4 behaviour: connector attempts to stream from the stored offset without slot-state validation; if the requested LSN is no longer in WAL, Postgres returns a cryptic error. The optional strict flag internal.slot.seek.to.known.offset.on.start=true would immediately fail with "Saved offset is before replication slot's confirmed lsn," forcing the user to full-resync their database — even though no actual data had been lost. Both paths were operationally unsafe for the legitimate connector_and_driver keepalive-flush shape.

  3. Second contribution — offset.mismatch.strategy lets operators pick which source of truth to trust. Zalando's DBZ-9688 / PR #6948 introduces a four-value enum inspired by Kafka's auto.offset.reset:

  4. no_validation (default) — stream from stored offset without validating slot state. Preserves pre-3.4 default behaviour.
  5. trust_offset — fail if the offset is behind the slot (indicating potential data loss); advance the slot to the offset via pg_replication_slot_advance() when the offset is ahead. Replaces internal.slot.seek.to.known.offset.on.start=true.
  6. trust_slot — slot is authoritative. If the offset is behind the slot, advance the offset to match the slot (skip replaying events between the two). Pairs with lsn.flush.mode=connector_and_driver for production-ready use with durable offset stores.
  7. trust_greater_lsn — bidirectional sync; advance whichever side is behind.

Backward-compat: the deprecated internal.slot.seek.to.known.offset.on.start boolean auto-maps (falseno_validation, truetrust_offset). Canonical instance of patterns/authoritative-slot-over-authoritative-offset as a per-deployment architectural posture rather than a framework-imposed default.

  1. Why Zalando trusts the slot (and most users don't). Zalando ran with MemoryOffsetBackingStore since 2018 — an ephemeral offset store — explicitly because they treated the Postgres replication slot as the authoritative source of stream position. The enabler is their long-running Patroni-managed HA Postgres stack — since the mid-2010s Zalando has run Postgres with automatic failover and replication-slot management that ensures slots survive failovers. "From day one of our logical replication rollout in late 2018, we implemented replication slot management that ensured slots survived failovers, so we could confidently trust the slot position as durable and correct." Most other Debezium users run Kafka Connect offset topics as durable ground truth and treat the replication slot as a Postgres implementation detail — which is why the keepalive-flush's slot-advance-past-offset behaviour broke their operator contract.

  2. Recovery use case: manual slot advance past corrupted WAL. trust_slot and trust_greater_lsn also help with operator-driven recovery — when an operator needs to advance a slot past corrupted WAL using pg_replication_slot_advance(), the connector can be configured to respect that change instead of refusing to start. Converts an otherwise-unrecoverable situation (full re-snapshot) into a targeted slot-advance operation.

  3. Recommended configuration combinations. The closing prescription ties the two properties together per use case:

  4. WAL growth on low-activity databases → lsn.flush.mode=connector_and_driver + offset.mismatch.strategy=trust_greater_lsn (for persistent offset stores).
  5. Manual recovery past corrupted WAL → offset.mismatch.strategy=trust_slot or trust_greater_lsn.
  6. Maximum safety / detect unexpected slot changes → offset.mismatch.strategy=trust_offset — catches potential data-loss scenarios early.

  7. Backward-compatibility via auto-mapping deprecated booleans. Both contributions follow the same migration pattern: deprecate a boolean, introduce an enum, map the boolean's values to equivalent enum values at load time. Canonical instance of patterns/backward-compatible-config-migration — gives users a deprecation window to adopt the new config without breaking existing deployments.

Architecture: the two-location position-tracking problem

Logical-replication position is persisted in two independent locations:

  • Subscriber side — Debezium's offset store (Kafka Connect offset topic, Debezium Engine backing store, in-memory store). Durability properties depend on the store.
  • Primary side — the Postgres logical replication slot's confirmed_flush_lsn (advances when the client acks) and restart_lsn (oldest WAL still required).

On a clean run both track in lockstep: connector acks an event → offset moves forward → slot moves forward. They diverge when someone advances the slot without the connector acking an event:

  • pgjdbc keepalive flush (Zalando's 2023 fix) — driver acks an LSN reported in a server KeepAlive when all observed replication messages are flushed. Advances the slot; connector offset stays put until the next change event.
  • pg_replication_slot_advance() (operator action) — manual recovery primitive to skip past corrupted WAL. Slot advances; offset store is untouched.

Post-restart, the connector must decide: the stored offset and the slot's confirmed_flush_lsn disagree. Which wins?

Scenario Slot LSN Offset LSN Safe action
Clean run Equal Equal Stream from either
Keepalive-flushed while idle Ahead Behind Advance offset to slot (slot is correct)
Connector crashed mid-batch Equal / behind Ahead Advance slot to offset (replay nothing)
Operator ran pg_replication_slot_advance() Ahead Behind Advance offset to slot (operator asserted skip)
Connector fresh-snapshot + corrupted slot Way ahead Way behind Fail — real data loss

No single default is correct across all four rows. Zalando's contribution is to make the decision explicit and configurable per deployment.

Why Zalando's architecture is structurally unusual

Three properties compose to make Zalando "apparently the only ones who liked" the pgjdbc keepalive flush:

  1. MemoryOffsetBackingStore — ephemeral offset store by design. On every connector restart the slot is authoritative; no offset to disagree with. Most Debezium users run Kafka Connect offset topics (persistent) where this shape doesn't apply.
  2. Patroni-managed Postgres — replication slots survive automatic failover by construction. Slots are not a primary-local detail that evaporates on promotion; they're durable-across-failover. Most Debezium users run Postgres where slot loss on failover is an accepted operational hazard.
  3. Long deployment track record"ran it for nearly two years, processing billions of events with zero detected data loss from this mechanism." Gave Zalando empirical confidence to continue with the pgjdbc keepalive flush even after Debezium disabled it upstream.

Absent any one of the three, trusting the slot over the offset store is operationally unsafe. Absent all three (the typical Debezium-with-Kafka-Connect deployment), the keepalive flush must be disabled-by-default or it breaks the operator contract — which is why Debezium was right to hard-code it off, and why the correct remediation is opt-in, not a new default.

Operational numbers

  • Scale at publication: "hundreds of event streams" processing "hundreds of thousands of events per second" across Zalando's 100+ Kubernetes clusters at peak.
  • History: infrastructure in operation since late 2018; billions of events processed.
  • Pre-Debezium-disable production run: Debezium 2.7.4 pinned to pgjdbc 42.7.2, nearly two years, zero detected data loss from the keepalive-flush mechanism.
  • Debezium release with both fixes: 3.4.0.Final (2025-12-16).
  • Jira issues: DBZ-9641 (lsn.flush.mode), DBZ-9688 (offset.mismatch.strategy).
  • PRs: #6881, #6948.

Caveats and what the post does not cover

  • No quantitative latency / throughput / WAL-size before-after graphs for the Debezium-3.4 rollout; the post is an architectural retrospective rather than a benchmarking report.
  • The third-party Airbyte issue is cited as a symptom example but not deeply analysed; Zalando focuses on the structural design rather than the surface-level failure mode.
  • The post does not describe Zalando's internal rollout discipline for the Debezium 3.4 upgrade (unlike the 2023 post's detailed patched-vs-unpatched Docker image split rollout).
  • The trust_slot + connector_and_driver combination's failure modes when Patroni-equivalent slot-durability discipline is absent are not benchmarked; Zalando prescribes the combination but doesn't walk through what happens to a user who adopts it without the slot-survival invariant.
  • No empirical comparison with alternative CDC-engine approaches to the same problem (Redpanda Connect's in-source checkpointing, e.g.).

Source

Last updated · 428 distilled / 1,221 read