Skip to content

CONCEPT Cited by 1 source

Runaway WAL growth

Definition

Runaway WAL growth is a Postgres failure mode in which the primary's WAL volume grows unboundedly because one or more logical replication slots pin WAL that the server cannot reclaim. In the limit the WAL fills the disk and the primary is demoted to read-only.

The failure mode is operator-facing rather than silent: the disk-usage graph has a clear ramp, monitoring catches it, but the recovery options are limited — either drop the slot (and force the downstream subscriber to re-snapshot) or coax the slot to advance.

The asymmetric-table shape

The canonical shape is three properties combined:

  1. The Postgres server has one WAL for all tables / schemas / databases on that server.
  2. One or more logical replication slots exist for low- write-traffic tables (the subscriber is interested in table X, which rarely changes).
  3. Other tables on the same server have high write traffic, producing WAL events continuously.

The slot for the low-traffic table owns its restart_lsn regardless of which table's events occupy the WAL. Because the slot never advances (no events on table X → no acks from the subscriber → restart_lsn stays put), Postgres cannot recycle any WAL segment past that restart_lsn — including segments consumed entirely by events the slow-slot doesn't care about.

Zalando's 2023-11 post canonicalises this verbatim:

"If the blue table were to continue receiving writes, but without a write operation occurring on the pink table, the pink replication slot would never have a chance to advance, and all of the blue WAL events would be left sitting around, taking up space. […] This will continue ad infinitum until the disk space is entirely used up by old WAL events that cannot be deleted until a write occurs in the pink table." (Source: sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver)

Mitigation spectrum

From kludge to pure-solution:

  1. Dummy-write heartbeats. Cron jobs that write rows to the low-traffic table to force the slot to advance. See concepts/dummy-write-heartbeat-kludge — works, but is permanent operational overhead.
  2. Drop the slot during outages. Requires the subscriber to re-snapshot. Disaster-recovery, not steady-state mitigation.
  3. Upsize the WAL volume. Buys time, doesn't solve the problem.
  4. Client-driver KeepAlive response — the pure fix. Let the logical-replication client ack the server's KeepAlive- reported LSN when it has flushed all Replication messages it has seen. See concepts/keepalive-message-lsn-advancement. Shipped in pgjdbc 42.7.0 via PR #2941 — Zalando's 2023 contribution.

Who's susceptible

  • Any JVM pipeline running Debezium or Debezium Engine against a low-traffic table on a shared-WAL Postgres primary — before pgjdbc 42.7.0.
  • Non-JVM consumers with similarly-blind KeepAlive behaviour (various wal2json integrations have had analogous bugs).
  • Any architecture with many per-stream replication slots (Zalando's low-code streaming platform shape, systems/zalando-postgres-event-streams) where the fleet-wide probability of having at least one low-traffic shared-server slot is high.

Seen in

  • sources/2023-11-08-zalando-patching-the-postgresql-jdbc-driver — canonical wiki introduction. Zalando's event-streaming team traces the mechanism step-by-step (single-server WAL, asymmetric table write rates, slot-level pinning), names the dummy-write kludge as the industry response, and ships the pure KeepAlive-LSN fix upstream in pgjdbc 42.7.0.
Last updated · 501 distilled / 1,218 read