Skip to content

CONCEPT Cited by 1 source

KeepAlive-message LSN advancement

Definition

KeepAlive-message LSN advancement is the technique of using Postgres logical-replication KeepAlive messages — periodic heartbeat frames sent server → client that carry the current server WAL LSN — as a cue for the client to acknowledge the server-reported LSN when it has no outstanding Replication messages to ack. Acking a higher LSN advances the server-side logical replication slot's confirmed_flush_lsn, which lets Postgres reclaim older WAL.

The problem it solves

Without KeepAlive-driven advancement, a subscriber's slot stalls whenever the subscribed table has no changes, even while other tables on the same server are generating WAL. The subscribe-nothing slot pins WAL indefinitely → concepts/runaway-wal-growth.

How the Postgres wire protocol enables it

Per Zalando's 2023-11-08 post (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver):

"the KeepAlive message contains very little data: some identifiers, a timestamp, a single bit denoting if a reply is required, but most crucially, the KeepAlive message contains the current WAL LSN of the database server."

Postgres sends KeepAlives periodically on a logical-replication connection to keep TCP alive; the current server LSN field is the single load-bearing payload for this advancement technique.

The safety invariant

The advancement is conservative:

  • Track lastReceivedReplicationLSN — the LSN of the most recent Replication message delivered to the client.
  • Track lastConfirmedLSN — the LSN the client has acked back to the server.
  • On KeepAlive with serverLSN > lastReceivedReplicationLSN, and lastReceivedReplicationLSN == lastConfirmedLSN (everything seen has been flushed), the client can safely ack serverLSN.

The invariant — "all seen replication messages are flushed before acking a higher LSN" — guarantees no event can be skipped. The KeepAlive ack only advances the slot through WAL the client has no interest in.

Zalando frames the safety property verbatim:

"This approach is sufficiently conservative enough to allow confirmation of LSNs while guaranteeing that no relevant events can be skipped."

Canonical implementation: pgjdbc 42.7.0

Zalando's PR #2941 against pgjdbc merged on 2023-08-31 and shipped in pgjdbc 42.7.0. Before this fix, pgjdbc ignored KeepAlives entirely. After, it implements the two-LSN tracker + safety invariant above.

Because pgjdbc is a transitive dependency of every JVM CDC framework talking to Postgres — most notably Debezium and Debezium Engine — the fix propagates through the downstream ecosystem as consumers pick up pgjdbc 42.7.0+. Canonical instance of patterns/client-driver-fix-over-application-workaround.

Contrast with the kludge

Before this fix was available / rolled out, the industry mitigation was dummy writes — scheduled jobs that wrote rows to the low-traffic table to force the slot to advance. Structurally distinct because the kludge operates at the application layer with visible operational overhead (every table with a CDC subscriber needs its own heartbeat writer), while KeepAlive-LSN advancement operates at the driver / wire- protocol layer transparently.

Seen in

  • sources/2023-11-08-zalando-patching-the-postgresql-jdbc-drivercanonical wiki introduction of the KeepAlive-LSN- advancement mechanism. Zalando's diagnosis traces the root cause (pgjdbc ignoring KeepAlives), Byron Wolfman's and Gunnar Morling's prior blog posts that pointed at the pure solution without implementing it, and Zalando's final implementation via pgjdbc PR #2941. The before/after message flow diagrams distinguish the two eras of pgjdbc behaviour.
Last updated · 501 distilled / 1,218 read