CONCEPT Cited by 2 sources

KeepAlive-message LSN advancement¶

Definition¶

KeepAlive-message LSN advancement is the technique of using Postgres logical-replication KeepAlive messages — periodic heartbeat frames sent server → client that carry the current server WAL LSN — as a cue for the client to acknowledge the server-reported LSN when it has no outstanding Replication messages to ack. Acking a higher LSN advances the server-side logical replication slot's confirmed_flush_lsn, which lets Postgres reclaim older WAL.

The problem it solves¶

Without KeepAlive-driven advancement, a subscriber's slot stalls whenever the subscribed table has no changes, even while other tables on the same server are generating WAL. The subscribe-nothing slot pins WAL indefinitely → concepts/runaway-wal-growth.

How the Postgres wire protocol enables it¶

Per Zalando's 2023-11-08 post (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver):

"the KeepAlive message contains very little data: some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server."

Postgres sends KeepAlives periodically on a logical-replication connection to keep TCP alive; the current server LSN field is the single load-bearing payload for this advancement technique.

The safety invariant¶

The advancement is conservative:

Track lastReceivedReplicationLSN — the LSN of the most recent Replication message delivered to the client.
Track lastConfirmedLSN — the LSN the client has acked back to the server.
On KeepAlive with serverLSN > lastReceivedReplicationLSN, and lastReceivedReplicationLSN == lastConfirmedLSN (everything seen has been flushed), the client can safely ack serverLSN.

The invariant — "all seen replication messages are flushed before acking a higher LSN" — guarantees no event can be skipped. The KeepAlive ack only advances the slot through WAL the client has no interest in.

Zalando frames the safety property verbatim:

"This approach is sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped."

Canonical implementation: pgjdbc 42.7.0¶

Zalando's PR #2941 against pgjdbc merged on 2023-08-31 and shipped in pgjdbc 42.7.0. Before this fix, pgjdbc ignored KeepAlives entirely. After, it implements the two-LSN tracker + safety invariant above.

Because pgjdbc is a transitive dependency of every JVM CDC framework talking to Postgres — most notably Debezium and Debezium Engine — the fix propagates through the downstream ecosystem as consumers pick up pgjdbc 42.7.0+. Canonical instance of patterns/client-driver-fix-over-application-workaround.

Contrast with the kludge¶

Before this fix was available / rolled out, the industry mitigation was dummy writes — scheduled jobs that wrote rows to the low-traffic table to force the slot to advance. Structurally distinct because the kludge operates at the application layer with visible operational overhead (every table with a CDC subscriber needs its own heartbeat writer), while KeepAlive-LSN advancement operates at the driver / wire- protocol layer transparently.

Seen in¶

sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale — canonical wiki disclosure of the 2024 Debezium-side disable and the 2025-12 re-enablement as opt-in. After the pgjdbc 42.7.0 fix this page documents was deployed, Debezium discovered the feature interacted poorly with its own LSN management and hard-disabled it via PR #6472's withAutomaticFlush(false) — breaking Zalando's upgrade path. Zalando contributed lsn.flush.mode (DBZ-9641 / PR #6881) to Debezium 3.4.0.Final to re-enable this mechanism as opt-in under lsn.flush.mode=connector_and_driver, with the safe default (connector) leaving pgjdbc's keepalive flush disabled. Canonical second-generation instance of opt-in driver- level LSN flush — the framework-layer sequel to the driver-layer fix.
sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction of the KeepAlive-LSN- advancement mechanism. Zalando's diagnosis traces the root cause (pgjdbc ignoring KeepAlives), Byron Wolfman's and Gunnar Morling's prior blog posts that pointed at the pure solution without implementing it, and Zalando's final implementation via pgjdbc PR #2941. The before/after message flow diagrams distinguish the two eras of pgjdbc behaviour.

concepts/postgres-logical-replication-slot — the slot whose confirmed_flush_lsn the technique advances.
concepts/logical-replication — the mode the technique applies in.
concepts/wal-write-ahead-logging — the log the technique allows Postgres to reclaim.
concepts/runaway-wal-growth — the failure mode it prevents.
concepts/dummy-write-heartbeat-kludge — the kludge the technique replaces.
systems/pgjdbc-postgres-jdbc-driver — where the canonical implementation landed.
systems/debezium — the primary downstream beneficiary via transitive-dep upgrade.
patterns/client-driver-fix-over-application-workaround — the architectural lever.