Skip to content

ZALANDO 2023-11-08

Read original ↗

Zalando — Patching the PostgreSQL JDBC Driver

Summary

Zalando's 2023-11-08 engineering post (authored by the team running their internal Postgres-sourced event-streaming platform) diagnoses a long-standing runaway-WAL-growth bug in Postgres logical replication and describes the fix Zalando upstreamed into pgjdbc (the Postgres JDBC driver). The bug: an idle logical replication slot for a low-write-traffic table pins WAL on the primary indefinitely while other tables on the same server receive writes, eventually exhausting disk space and demoting the node to read-only. The fix: make pgjdbc respond to Postgres KeepAlive messages (which carry the current server WAL LSN) by advancing the subscriber's confirmed_flush_lsn once all seen replication messages have been flushed — the canonical "pure solution" that had been documented but never implemented. The PR merged upstream on 2023-08-31 and shipped in pgjdbc 42.7.0. All Java clients of pgjdbc, including Debezium, automatically get the fix — a canonical instance of patterns/client-driver-fix-over-application-workaround: solve it at the driver layer and the whole downstream ecosystem inherits the fix.

Key takeaways

  1. The failure mode is asymmetric tables on the same server. Zalando has hundreds of Postgres-sourced event streams in production. When a logical replication client subscribes to a low-write-traffic table (the "pink" table in the post's running example), but that table shares a Postgres server with high-write-traffic tables ("blue"), the pink slot's restart_lsn never advances. WAL events for both tables go to the same server-level WAL; the pink slot pins them indefinitely — "until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table." (Source: sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver)

  2. Replication slots are a server-level construct over a server-level WAL. The post canonicalises the architectural lever: "The WAL exists at the server level and we cannot break it down into a table-level or schema-level concept. All changes for all tables in all schemas in all databases on that server go into the same WAL." Cross-references concepts/postgres-logical-replication-slot and concepts/wal-write-ahead-logging.

  3. The industry kludge: dummy writes. The most popular mitigation — also used inside Zalando historically and named in both Gunnar Morling's (ex-Debezium lead) blog post and Byron Wolfman's post — is scheduled dummy writes to the low-traffic table to force the slot to advance. The post frames this verbatim as "a kludge that doesn't address the real issue at the heart of the problem and mandates a constant extra workload overhead from now and forever more when setting up Postgres logical replication." See concepts/dummy-write-heartbeat-kludge — structurally distinct from general replication heartbeats because it exists purely to advance a replication slot, not to measure lag.

  4. The pure solution lives in KeepAlive messages. Postgres sends KeepAlive messages at regular intervals to maintain the connection; they contain "some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server." Historically pgjdbc didn't respond — so the server-side slot state never advanced on the basis of KeepAlives. The fix has the client track two LSNs: the last Replication message received and the latest message confirmed. If they are equal (all seen changes are flushed) and a KeepAlive arrives with a higher LSN, the client infers "some irrelevant changes are happening on the database that the client doesn't care about" and safely acks the higher LSN back, which advances the server-side slot. See concepts/keepalive-message-lsn-advancement.

  5. Upstream the fix at the client driver layer. Because pgjdbc is a transitive dependency of Debezium (and therefore of every JVM Debezium-based CDC pipeline), the upstream fix propagates through the ecosystem automatically once downstream consumers pick up pgjdbc 42.7.0+. The post frames this as a deliberate choice: "If we could solve the issue at this level, then it would be abstracted away from — and solved for — all Java applications, Debezium included." Canonical instance of patterns/upstream-the-fix at the Postgres-JDBC-driver altitude, and of patterns/client-driver-fix-over-application-workaround — the driver is below every consuming framework.

  6. Rollout used a patched vs unpatched Docker-image split. Zalando's event-streaming apps don't use the latest Debezium (to preserve backwards compatibility with legacy downstream functionality), which means they don't get pgjdbc 42.7.0 via normal transitive updates. Instead Zalando modified their build scripts to override the transitive pgjdbc version with a locally-built 42.6.1-patched jar carrying the fix, then built two Docker images — one unchanged, one with the patched driver. The patched image went to the test environment first; the unpatched one stayed in production. Canonical instance of patterns/parallel-docker-image-prod-vs-test-for-patched-library and of the transitive- dependency override build primitive.

  7. Conservative by design — zero skipped events. The fix is deliberately restricted to only acking a higher LSN when the client has flushed all Replication messages it has seen. This means no KeepAlive-driven ack can ever skip an in-flight event — "sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped." The production monitor is a flat WAL-size graph over multi-day windows on low-activity databases; before-fix and after-fix graphs over the same 36- hour window show runaway growth replaced by a stable trace.

Zalando's event-streaming platform context

Zalando exposes a low-code event-streaming primitive to internal builders: declare an event stream sourced from a Postgres table and the platform provisions a dedicated "Postgres-sourced event streams" micro-application. Each such app runs Debezium Engine (the embedded-mode variant of Debezium; distinct from the Debezium Kafka Connect source connectors) and publishes table-level change events to a variety of downstream technologies, with arbitrary AWS Lambda-based event transformations in the middle. Scale: "hundreds of these Postgres-sourced event streams out in the wild at Zalando" at publication time.

The post is therefore an applied Debezium-Engine deployment post — not a standalone pgjdbc post — and the runaway-WAL-growth failure mode shows up at Zalando precisely because they have a large number of low-traffic Postgres-sourced streams each owning their own logical replication slot on a shared-tenant Postgres server.

Architecture: replication slot and WAL advancement

The post's running example:

  • A Postgres server with three tables (blue / pink / purple).
  • Blue and pink each have a logical replication slot feeding a client.
  • All three tables write to the single server-level WAL.
  • Blue receives writes continuously; pink and purple receive almost no writes.

Sequence: 1. Blue write at WAL position #7 → blue client acks → blue slot advances to #7. 2. WAL position #7 cannot be reclaimed because the pink slot is still at #6 (even though pink isn't interested in blue events). 3. Blue continues writing — WAL grows at #8, #9, ..., #13 — none reclaimable. 4. A write finally occurs on pink at position #14 → pink client acks → pink slot advances to #14 → Postgres can finally reclaim WAL up to #13.

Without step 4 ever happening, WAL grows unbounded. The wal_level=logical configuration is the precondition that makes logical decoding possible but does not change the slot-level WAL-pinning property.

The pgjdbc fix in detail

Before

Server → Client:   Replication msg (LSN=7)
Client → Server:   Ack (LSN=7)                ← replication slot advances
Server → Client:   KeepAlive (LSN=13)
Client → Server:   (no reply — bug)           ← slot stays at 7

After (PR #2941, merged 2023-08-31, in pgjdbc 42.7.0+)

Server → Client:   Replication msg (LSN=7)
Client → Server:   Ack (LSN=7)                ← slot advances to 7
Server → Client:   KeepAlive (LSN=13)
Client internal:   (lastSeen = 7, lastConfirmed = 7; all seen msgs flushed)
Client → Server:   Ack (LSN=13)               ← slot advances to 13

The client's invariant — only ack a higher LSN than any Replication message received when everything seen has been flushed — is what makes the fix safe: it cannot move the slot past an in-flight event.

Operational numbers

  • Scale: hundreds of Postgres-sourced event streams at Zalando at publication time.
  • pgjdbc PR merge date: 2023-08-31.
  • pgjdbc release with fix: 42.7.0.
  • Zalando's local rollout: built 42.6.1-patched variant jar with the fix backported; wrapped two Docker images (unchanged and patched) via a conditional override in the build scripts; deployed patched image to test cluster first.
  • Monitoring window: 36-hour WAL-size graph on a low-activity database. Pre-fix graph shows a growing ramp; post-fix graph is roughly flat.

Caveats and what the post does not cover

  • The post does not quantify the KeepAlive interval or the average time-to-first-reclaimable-WAL-segment post-fix. The flat-graph evidence is qualitative.
  • The post does not cover whether pgjdbc 42.7.0's fix interacts with any non-pgoutput logical-decoding plugins (e.g. wal2json, decoderbufs); implementation is at the Postgres wire-protocol layer so should apply uniformly, but not disclosed.
  • The post does not describe behaviour when a slot is dropped or when a subscriber experiences a long outage. The fix addresses the steady-state runaway-WAL-growth failure mode, not disaster recovery of a lost slot.
  • The production rollout discipline (test cluster first, then production after visual graph confirmation) is lighter-weight than some of the more elaborate production-qualification disciplines on the wiki (e.g. Yelp's patterns/pre-flight-flight-post-flight-upgrade-stages).

Source

Last updated · 501 distilled / 1,218 read