CONCEPT Cited by 1 source
KeepAlive-message LSN advancement¶
Definition¶
KeepAlive-message LSN advancement is the technique of
using Postgres logical-replication KeepAlive messages —
periodic heartbeat frames sent server → client that carry the
current server WAL LSN — as a cue for the client to
acknowledge the server-reported LSN when it has no
outstanding Replication messages to ack. Acking a higher LSN
advances the server-side
logical
replication slot's confirmed_flush_lsn, which lets
Postgres reclaim older WAL.
The problem it solves¶
Without KeepAlive-driven advancement, a subscriber's slot stalls whenever the subscribed table has no changes, even while other tables on the same server are generating WAL. The subscribe-nothing slot pins WAL indefinitely → concepts/runaway-wal-growth.
How the Postgres wire protocol enables it¶
Per Zalando's 2023-11-08 post (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver):
"the KeepAlive message contains very little data: some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server."
Postgres sends KeepAlives periodically on a logical-replication connection to keep TCP alive; the current server LSN field is the single load-bearing payload for this advancement technique.
The safety invariant¶
The advancement is conservative:
- Track
lastReceivedReplicationLSN— the LSN of the most recent Replication message delivered to the client. - Track
lastConfirmedLSN— the LSN the client has acked back to the server. - On KeepAlive with
serverLSN > lastReceivedReplicationLSN, andlastReceivedReplicationLSN == lastConfirmedLSN(everything seen has been flushed), the client can safely ackserverLSN.
The invariant — "all seen replication messages are flushed before acking a higher LSN" — guarantees no event can be skipped. The KeepAlive ack only advances the slot through WAL the client has no interest in.
Zalando frames the safety property verbatim:
"This approach is sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped."
Canonical implementation: pgjdbc 42.7.0¶
Zalando's PR #2941 against pgjdbc merged on 2023-08-31 and shipped in pgjdbc 42.7.0. Before this fix, pgjdbc ignored KeepAlives entirely. After, it implements the two-LSN tracker + safety invariant above.
Because pgjdbc is a transitive dependency of every JVM CDC framework talking to Postgres — most notably Debezium and Debezium Engine — the fix propagates through the downstream ecosystem as consumers pick up pgjdbc 42.7.0+. Canonical instance of patterns/client-driver-fix-over-application-workaround.
Contrast with the kludge¶
Before this fix was available / rolled out, the industry mitigation was dummy writes — scheduled jobs that wrote rows to the low-traffic table to force the slot to advance. Structurally distinct because the kludge operates at the application layer with visible operational overhead (every table with a CDC subscriber needs its own heartbeat writer), while KeepAlive-LSN advancement operates at the driver / wire- protocol layer transparently.
Seen in¶
- sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction of the KeepAlive-LSN- advancement mechanism. Zalando's diagnosis traces the root cause (pgjdbc ignoring KeepAlives), Byron Wolfman's and Gunnar Morling's prior blog posts that pointed at the pure solution without implementing it, and Zalando's final implementation via pgjdbc PR #2941. The before/after message flow diagrams distinguish the two eras of pgjdbc behaviour.
Related¶
- concepts/postgres-logical-replication-slot — the slot
whose
confirmed_flush_lsnthe technique advances. - concepts/logical-replication — the mode the technique applies in.
- concepts/wal-write-ahead-logging — the log the technique allows Postgres to reclaim.
- concepts/runaway-wal-growth — the failure mode it prevents.
- concepts/dummy-write-heartbeat-kludge — the kludge the technique replaces.
- systems/pgjdbc-postgres-jdbc-driver — where the canonical implementation landed.
- systems/debezium — the primary downstream beneficiary via transitive-dep upgrade.
- patterns/client-driver-fix-over-application-workaround — the architectural lever.