CONCEPT Cited by 2 sources

Runaway WAL growth¶

Definition¶

Runaway WAL growth is a Postgres failure mode in which the primary's WAL volume grows unboundedly because one or more logical replication slots pin WAL that the server cannot reclaim. In the limit the WAL fills the disk and the primary is demoted to read-only.

The failure mode is operator-facing rather than silent: the disk-usage graph has a clear ramp, monitoring catches it, but the recovery options are limited — either drop the slot (and force the downstream subscriber to re-snapshot) or coax the slot to advance.

The asymmetric-table shape¶

The canonical shape is three properties combined:

The Postgres server has one WAL for all tables / schemas / databases on that server.
One or more logical replication slots exist for low- write-traffic tables (the subscriber is interested in table X, which rarely changes).
Other tables on the same server have high write traffic, producing WAL events continuously.

The slot for the low-traffic table owns its restart_lsn regardless of which table's events occupy the WAL. Because the slot never advances (no events on table X → no acks from the subscriber → restart_lsn stays put), Postgres cannot recycle any WAL segment past that restart_lsn — including segments consumed entirely by events the slow-slot doesn't care about.

Zalando's 2023-11 post canonicalises this verbatim:

"If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space. […] This will continue ad infinitum until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table." (Source: sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver)

Mitigation spectrum¶

From kludge to pure-solution:

Dummy-write heartbeats. Cron jobs that write rows to the low-traffic table to force the slot to advance. See concepts/dummy-write-heartbeat-kludge — works, but is permanent operational overhead.
Drop the slot during outages. Requires the subscriber to re-snapshot. Disaster-recovery, not steady-state mitigation.
Upsize the WAL volume. Buys time, doesn't solve the problem.
Client-driver KeepAlive response — the pure fix. Let the logical-replication client ack the server's KeepAlive- reported LSN when it has flushed all Replication messages it has seen. See concepts/keepalive-message-lsn-advancement. Shipped in pgjdbc 42.7.0 via PR #2941 — Zalando's 2023 contribution.

Who's susceptible¶

Any JVM pipeline running Debezium or Debezium Engine against a low-traffic table on a shared-WAL Postgres primary — before pgjdbc 42.7.0.
Non-JVM consumers with similarly-blind KeepAlive behaviour (various wal2json integrations have had analogous bugs).
Any architecture with many per-stream replication slots (Zalando's low-code streaming platform shape, systems/zalando-postgres-event-streams) where the fleet-wide probability of having at least one low-traffic shared-server slot is high.

Zalando ran this failure mode at fleet scale with "hundreds of Postgres-sourced event streams" (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver) before upstreaming the pgjdbc fix.
PlanetScale's 2025-09-12 post on HA-CDC coupling (sources/2026-04-21-planetscale-postgres-high-availability-with-cdc) canonicalises adjacent failure modes — dormant physical standbys pinning restart_lsn through logical-slot failover interactions, quiet-period failover breaking CDC.

Seen in¶

sources/2025-12-18-zalando-contributing-to-debezium-fixing-logical-replication-at-scale — sequel to the 2023 pgjdbc fix: the failure mode returns at the Debezium-framework layer. The 2024 Debezium-side disable of the pgjdbc keepalive-flush feature re-exposed the runaway WAL-growth shape for users who had depended on the 2023 fix. Zalando's remediation (lsn.flush.mode=connector_and_driver — Debezium 3.4.0.Final, DBZ-9641) makes the fix reachable again as an opt-in mode, paired with offset.mismatch.strategy for startup-mismatch safety. Canonicalises the WAL-growth failure mode as also reproducible from the downstream side (framework disable) even when the upstream driver fix is in place, and opens the architectural lesson: driver-layer fixes interact with framework-layer position-tracking logic, and the interaction needs explicit configuration rather than silent framework defaults.
sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction. Zalando's event-streaming team traces the mechanism step-by-step (single-server WAL, asymmetric table write rates, slot-level pinning), names the dummy-write kludge as the industry response, and ships the pure KeepAlive-LSN fix upstream in pgjdbc 42.7.0.

concepts/postgres-logical-replication-slot — the primary-local object that pins WAL.
concepts/logical-replication — the replication mode the slot feeds.
concepts/postgres-wal-level-logical — precondition.
concepts/wal-write-ahead-logging — the log that grows.
concepts/keepalive-message-lsn-advancement — the pure fix.
concepts/dummy-write-heartbeat-kludge — the kludge fix.
concepts/ha-cdc-coupling — adjacent structural failure modes.
systems/pgjdbc-postgres-jdbc-driver — where the fix landed.
systems/debezium — the most common downstream of an affected slot.