CONCEPT Cited by 1 source
Runaway WAL growth¶
Definition¶
Runaway WAL growth is a Postgres failure mode in which the primary's WAL volume grows unboundedly because one or more logical replication slots pin WAL that the server cannot reclaim. In the limit the WAL fills the disk and the primary is demoted to read-only.
The failure mode is operator-facing rather than silent: the disk-usage graph has a clear ramp, monitoring catches it, but the recovery options are limited — either drop the slot (and force the downstream subscriber to re-snapshot) or coax the slot to advance.
The asymmetric-table shape¶
The canonical shape is three properties combined:
- The Postgres server has one WAL for all tables / schemas / databases on that server.
- One or more logical replication slots exist for low- write-traffic tables (the subscriber is interested in table X, which rarely changes).
- Other tables on the same server have high write traffic, producing WAL events continuously.
The slot for the low-traffic table owns its restart_lsn
regardless of which table's events occupy the WAL. Because the
slot never advances (no events on table X → no acks from the
subscriber → restart_lsn stays put), Postgres cannot recycle
any WAL segment past that restart_lsn — including segments
consumed entirely by events the slow-slot doesn't care about.
Zalando's 2023-11 post canonicalises this verbatim:
"If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space. […] This will continue ad infinitum until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table." (Source: sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver)
Mitigation spectrum¶
From kludge to pure-solution:
- Dummy-write heartbeats. Cron jobs that write rows to the low-traffic table to force the slot to advance. See concepts/dummy-write-heartbeat-kludge — works, but is permanent operational overhead.
- Drop the slot during outages. Requires the subscriber to re-snapshot. Disaster-recovery, not steady-state mitigation.
- Upsize the WAL volume. Buys time, doesn't solve the problem.
- Client-driver KeepAlive response — the pure fix. Let the logical-replication client ack the server's KeepAlive- reported LSN when it has flushed all Replication messages it has seen. See concepts/keepalive-message-lsn-advancement. Shipped in pgjdbc 42.7.0 via PR #2941 — Zalando's 2023 contribution.
Who's susceptible¶
- Any JVM pipeline running Debezium or Debezium Engine against a low-traffic table on a shared-WAL Postgres primary — before pgjdbc 42.7.0.
- Non-JVM consumers with similarly-blind KeepAlive behaviour
(various
wal2jsonintegrations have had analogous bugs). - Any architecture with many per-stream replication slots (Zalando's low-code streaming platform shape, systems/zalando-postgres-event-streams) where the fleet-wide probability of having at least one low-traffic shared-server slot is high.
Related production disclosures¶
- Zalando ran this failure mode at fleet scale with "hundreds of Postgres-sourced event streams" (sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver) before upstreaming the pgjdbc fix.
- PlanetScale's 2025-09-12 post on
HA-CDC coupling
(sources/2026-04-21-planetscale-postgres-high-availability-with-cdc)
canonicalises adjacent failure modes — dormant physical
standbys pinning
restart_lsnthrough logical-slot failover interactions, quiet-period failover breaking CDC.
Seen in¶
- sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction. Zalando's event-streaming team traces the mechanism step-by-step (single-server WAL, asymmetric table write rates, slot-level pinning), names the dummy-write kludge as the industry response, and ships the pure KeepAlive-LSN fix upstream in pgjdbc 42.7.0.
Related¶
- concepts/postgres-logical-replication-slot — the primary-local object that pins WAL.
- concepts/logical-replication — the replication mode the slot feeds.
- concepts/postgres-wal-level-logical — precondition.
- concepts/wal-write-ahead-logging — the log that grows.
- concepts/keepalive-message-lsn-advancement — the pure fix.
- concepts/dummy-write-heartbeat-kludge — the kludge fix.
- concepts/ha-cdc-coupling — adjacent structural failure modes.
- systems/pgjdbc-postgres-jdbc-driver — where the fix landed.
- systems/debezium — the most common downstream of an affected slot.