Skip to content

AWS 2025-05-03 Tier 1

Read original ↗

Understanding transaction visibility in PostgreSQL clusters with read replicas

Summary

AWS's response to Jepsen's 2025-04-29 report on transaction visibility in Amazon RDS for PostgreSQL Multi-AZ clusters. AWS confirms Jepsen's finding but clarifies that the behavior is inherent to community PostgreSQL (discussed on pgsql-hackers since 2013), not an RDS-specific bug, and not present in Single-AZ deployments, Aurora PostgreSQL Limitless, or Aurora DSQL. The anomaly is a violation of formal Snapshot Isolation known in the literature as Long Fork: two readers on different nodes can observe concurrent non-conflicting transactions in different commit orders. Root cause: on a Postgres primary, the order transactions become visible (removal from the in-memory ProcArray) diverges from the order they become durable (WAL commit-record write) — visibility is asynchronous with respect to durability. All isolation levels (Read Committed, Repeatable Read, Serializable) are affected; the behavior reproduces on self-managed Postgres. A proposed fix using Commit Sequence Numbers was discussed at PGConf.EU 2024; AWS is contributing to the multi-patch upstream effort. Piece doubles as the public-facing framing for why Aurora DSQL/Limitless use time-based MVCC (which avoids the Long Fork anomaly) instead of Postgres's ProcArray-based visibility model.

Key takeaways

  1. The bug is in community Postgres, not in RDS. The same Long Fork anomaly reproduces on a self-managed Postgres cluster with read replicas; the pgsql-hackers thread that first named it is from 2013. AWS explicitly de-escalates "RDS bug" framing while confirming Jepsen's empirical finding. (Source: this article.) It is not present in Single-AZ Postgres, Aurora PostgreSQL Limitless, or systems/aurora-dsql.
  2. Visibility order ≠ commit order is the mechanism. Postgres commits by (a) writing the transaction's WAL commit record (durability), then (b) asynchronously removing itself from the in-memory ProcArray (visibility to new snapshots). If T1 and T2 commit concurrently, T1's WAL record can land before T2's while T2 removes itself from ProcArray before T1 — so a snapshot taken between those two events sees T2 but not T1. A replica replaying WAL in commit order will see the opposite. (Source: this article.) This is the generalization the wiki captures as concepts/visibility-order-vs-commit-order.
  3. Affects all isolation levels. Read Committed, Repeatable Read, and Serializable all acquire snapshots against ProcArray, so all three exhibit Long Fork. Increasing isolation level does not work around it. (Source: this article.)
  4. Low practical impact, high enterprise-feature impact. Most applications serialize their operations through application-level constraints or direct row conflicts, so they are not actually vulnerable. But the anomaly blocks five classes of advanced enterprise capabilities (Source: this article, paraphrased):
  5. Distributed-SQL consistency — impossible to obtain a consistent list of pending transactions across nodes; Aurora Limitless and systems/aurora-dsql sidestep by using time-based MVCC instead of ProcArray.
  6. Query routing / read-write splitting — routing reads to synchronously-caught-up replicas can expose non-repeatable reads.
  7. Data synchronization — snapshot-then-WAL-replay can land in a state that was never observable on the primary.
  8. Point-in-time restore to an LSN — can produce a state that was never observable, complicating application-level data-corruption analysis.
  9. Storage-layout optimization — replacing tuple xids with logical/clock-based commit times at query time breaks query-result repeatability.
  10. CPU utilization — on large Postgres servers with thousands of connections, snapshot acquisition (scanning ProcArray) is a measurable fraction of CPU in read-heavy workloads.
  11. Proposed upstream fix: Commit Sequence Numbers (CSN). Realign visibility order with commit order by stamping each commit with a monotonic CSN; snapshots become "read everything with CSN ≤ mySnapshot." The multi-patch series was presented at PGConf.EU 2024 ("High-concurrency distributed snapshots," Ants Aasma) and discussed on pgsql-hackers. AWS's PostgreSQL Contributors Team (formed 2022) is participating. (Source: this article.)
  12. Aurora DSQL and Aurora Limitless use time-based MVCC, not ProcArray. Both replace Postgres's visibility substrate with a clock-based consistent-snapshot model. This is the architectural payoff of extending Postgres rather than forking it (systems/aurora-dsql uses extensions to replace concurrency-control; Limitless does the same for its distributed shape) — the visibility mechanism is one of the components that gets replaced. (Source: this article + cross-reference to sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey.)
  13. AWS's recommended workarounds. Until CSNs land, AWS advises (a) never rely on implicit commit-ordering of independent concurrent transactions at the application layer, (b) introduce explicit synchronization when strict ordering is required — shared counters (ticket numbers, queue positions), timestamps (observed-at, execution-time), or database constraints (e.g. inventory >= 0). (Source: this article.)

The worked Alice-and-Bob example

Used in the post as the intuition pump for Long Fork. Page-view counters for Hacker News posts stored as rows. Alice's app server routes to the primary; Bob's to a replica. Both refresh continuously.

  • Alice sees the Jepsen post reach #1 (screenshots it).
  • Bob watching the replica sees the same post peak at #2.
  • The commit log confirms that the post's counter was briefly beaten by another post due to a concurrent click.
  • Technically Bob is right (per commit log) and Alice is right (her eyes don't lie) — she witnessed a database state on the primary that, per commit order, was never supposed to exist. Without the replica and without the commit log, Alice's observation would be fully compliant with formal Snapshot Isolation — and in fact is what a standalone Postgres node returns.

The structural point: formal Snapshot Isolation says your snapshots compose globally into a consistent commit order. Postgres's implementation does not guarantee this, at any isolation level.

Operational numbers

  • Since 2013 — pgsql-hackers mailing-list thread first naming the behavior.
  • 2022 — AWS formed its PostgreSQL Contributors Team (dedicated to core Postgres engine contributions).
  • 2024-09 — PGConf.EU 2024 presentation of the CSN fix (Aasma).
  • 2025-04-29 — Jepsen report published.
  • 2025-05-03 — this AWS response (4 days later).

No throughput / latency / cost / fleet-scale numbers disclosed — this is a correctness / architecture response, not a retrospective. The most quantitative claim is that snapshot acquisition on ProcArray is "a measurable fraction of CPU" at thousands of connections on large Postgres servers.

Caveats

  • This is a vendor response to a third-party analysis. AWS's framing works to re-situate the anomaly as community-Postgres rather than RDS-specific; readers should cross-read with the original Jepsen report for the tested setup, repro procedure, and the full set of anomalies Jepsen observed (not just Long Fork).
  • "Rarely impacts application correctness in practice" is a claim, not a measurement. AWS offers no quantitative data on how often applications are vulnerable; the correctness-in-practice argument is that most real apps create explicit row conflicts or use app-level serialization, so the anomaly is invisible. Teams running read-write splitting, snapshot-then-replay data pipelines, or cross-node analytical queries should treat the anomaly as live.
  • The CSN fix is a multi-patch upstream effort, not a shipped feature. No landing date given.
  • Aurora DSQL and Aurora Limitless sidestep, but through full replacement of the visibility substrate. This is not a workaround available to self-managed Postgres users — it requires Postgres-extension-level surgery (which only works because Postgres exposes the relevant hook points).
Last updated · 200 distilled / 1,178 read