Skip to content

CONCEPT Cited by 1 source

Postgres shared_buffers + OS page cache (double buffering)

Definition

PostgreSQL does not manage its own full disk-cache hierarchy. Instead, it implements a two-layer caching strategy — its own in-process shared_buffers cache for recently-accessed relation pages, sitting on top of the operating system's filesystem page cache. The same disk page can exist in both layers simultaneously — hence the double buffering label. (Source: sources/2025-07-08-planetscale-caching.)

Ben Dicken's framing

From the 2025-07-08 Caching post:

Postgres implements a two-layer caching strategy. First, it uses shared_buffers, an internal cache for data pages that store table information. This keeps frequently read row data in memory while less-frequently accessed data stays on disk. Second, Postgres relies heavily on the operating system's filesystem page cache, which caches disk pages at the kernel level. This creates a double-buffering system where data can exist in both Postgres's shared_buffers and the OS page cache. Many deployments set shared_buffers to around 25% of available RAM and let the filesystem cache handle much of the remaining caching work. The shared_buffers value can be configured in the postgres config file.

Why two layers

  • shared_buffers holds 8 KB Postgres pages keyed by (relfilenode, blocknum). Access is Postgres-protocol-native — the buffer manager can pin pages, track dirty bits for WAL, run MVCC snapshots, and drive eviction with Postgres-specific policies (clock-sweep with usage counts).
  • OS page cache holds raw 4 KB filesystem pages keyed by (file, offset). Access is via read(2) / pread(2) — to reach a Postgres page, the kernel still has to serve the underlying disk blocks.

Splitting the work:

  • shared_buffers optimises hit-path work inside Postgres (no syscall, no deserialisation overhead).
  • OS page cache optimises miss-path work — when the Postgres buffer pool misses but the OS already has the page from an earlier read, the "miss" still doesn't hit the disk.

The 25% rule

Many deployments set shared_buffers to around 25% of available RAM.

sources/2025-07-08-planetscale-caching

Why 25% rather than a higher fraction?

  • shared_buffers too small → Postgres misses more often, paying the syscall/deserialisation cost even on pages that are technically still in RAM (just in the OS layer, not the Postgres layer).
  • shared_buffers too large → the OS page cache gets starved, every Postgres miss really does hit the disk, and the double-buffering invariant collapses (same page in RAM twice, burning memory for nothing).
  • 25% around-of-RAM (or 1–8 GB for practical servers) is the empirical sweet spot across most workloads: big enough that the hot working set fits, small enough that the OS page cache has room for the rest.

Outside the 25% convention:

  • Small memory servers (< 1 GB RAM) — 128 MB or 15% rules sometimes apply.
  • Dedicated high-memory DB hosts (64+ GB RAM) — many teams push to 40% or explicit GB values, but returns diminish and the tuning burden rises.
  • Workload-specific tuning — large sequential scans prefer OS-cache dominance (it handles readahead better); point-heavy OLTP prefers shared_buffers dominance.

Read measurement

Postgres hit rate against shared_buffers:

SELECT sum(blks_hit)::float / (sum(blks_hit) + sum(blks_read)) AS hit_ratio
  FROM pg_stat_database;

Note: this is the Postgres hit rate, not the end-to-end disk-miss rate. A "miss" at the Postgres layer (counted in blks_read) may still hit the OS page cache with no disk I/O. To measure the full stack you need pg_buffercache + /proc/meminfo + iostat together.

Difference vs MySQL InnoDB

MySQL's equivalent is the InnoDB buffer pool — a single in-process cache that stores pages, handles doublewrite / change-buffer / adaptive-hash-index mechanics, and does not rely on the OS page cache (in its common O_DIRECT configuration, it explicitly bypasses it).

Two philosophies:

Axis Postgres MySQL InnoDB
Relies on OS page cache Yes — heavily No (common config)
DB-owned cache size rule-of-thumb ~25% of RAM 50–75% of RAM
Double-buffered pages in RAM Yes — acceptable No — considered waste
Hit rate measurement pg_stat_database SHOW ENGINE INNODB STATUS
Preferred IO mode Buffered + OS readahead O_DIRECT

Both approaches work. The Postgres approach leans on the OS being a good generalist; the MySQL approach leans on the DB knowing its workload better than the OS.

Why ACID-aware caching is harder

Dicken's closing observation (Source: sources/2025-07-08-planetscale-caching):

Arguably, these are more complex than a "regular" cache as they also have to be able to operate with full ACID semantics and database transactions. Both databases have to take careful measures to ensure these pages contain accurate information and metadata as the data evolves.

Key complexities:

  • Dirty page eviction must write back to WAL before overwriting, or crash-recovery loses durability.
  • MVCC snapshot visibility — multiple transactions may read conflicting versions of the same page.
  • Locking protocols (content locks, index locks) must be respected by the buffer manager.
  • Checkpoint synchronisation — the buffer pool is the handoff between in-memory dirty state and durable-on-disk state.

These are all orthogonal to the LRU-ness of the eviction itself but they constrain which pages can be evicted at which moment.

Seen in

Last updated · 319 distilled / 1,201 read