CONCEPT Cited by 1 source
Postgres shared_buffers + OS page cache (double buffering)¶
Definition¶
PostgreSQL does not manage its own full disk-cache
hierarchy. Instead, it implements a two-layer caching
strategy — its own in-process shared_buffers cache
for recently-accessed relation pages, sitting on top of
the operating system's filesystem
page cache. The same disk
page can exist in both layers simultaneously — hence the
double buffering label. (Source:
sources/2025-07-08-planetscale-caching.)
Ben Dicken's framing¶
From the 2025-07-08 Caching post:
Postgres implements a two-layer caching strategy. First, it uses
shared_buffers, an internal cache for data pages that store table information. This keeps frequently read row data in memory while less-frequently accessed data stays on disk. Second, Postgres relies heavily on the operating system's filesystem page cache, which caches disk pages at the kernel level. This creates a double-buffering system where data can exist in both Postgres'sshared_buffersand the OS page cache. Many deployments setshared_buffersto around 25% of available RAM and let the filesystem cache handle much of the remaining caching work. Theshared_buffersvalue can be configured in the postgres config file.
Why two layers¶
shared_buffersholds 8 KB Postgres pages keyed by(relfilenode, blocknum). Access is Postgres-protocol-native — the buffer manager can pin pages, track dirty bits for WAL, run MVCC snapshots, and drive eviction with Postgres-specific policies (clock-sweep with usage counts).- OS page cache holds raw 4 KB filesystem pages keyed
by
(file, offset). Access is viaread(2)/pread(2)— to reach a Postgres page, the kernel still has to serve the underlying disk blocks.
Splitting the work:
shared_buffersoptimises hit-path work inside Postgres (no syscall, no deserialisation overhead).- OS page cache optimises miss-path work — when the Postgres buffer pool misses but the OS already has the page from an earlier read, the "miss" still doesn't hit the disk.
The 25% rule¶
Many deployments set
shared_buffersto around 25% of available RAM.
Why 25% rather than a higher fraction?
shared_bufferstoo small → Postgres misses more often, paying the syscall/deserialisation cost even on pages that are technically still in RAM (just in the OS layer, not the Postgres layer).shared_bufferstoo large → the OS page cache gets starved, every Postgres miss really does hit the disk, and the double-buffering invariant collapses (same page in RAM twice, burning memory for nothing).- 25% around-of-RAM (or 1–8 GB for practical servers) is the empirical sweet spot across most workloads: big enough that the hot working set fits, small enough that the OS page cache has room for the rest.
Outside the 25% convention:
- Small memory servers (< 1 GB RAM) — 128 MB or 15% rules sometimes apply.
- Dedicated high-memory DB hosts (64+ GB RAM) — many teams push to 40% or explicit GB values, but returns diminish and the tuning burden rises.
- Workload-specific tuning — large sequential scans
prefer OS-cache dominance (it handles readahead better);
point-heavy OLTP prefers
shared_buffersdominance.
Read measurement¶
Postgres hit rate against shared_buffers:
Note: this is the Postgres hit rate, not the
end-to-end disk-miss rate. A "miss" at the Postgres layer
(counted in blks_read) may still hit the OS page cache
with no disk I/O. To measure the full stack you need
pg_buffercache + /proc/meminfo + iostat together.
Difference vs MySQL InnoDB¶
MySQL's equivalent is the
InnoDB buffer pool — a
single in-process cache that stores pages, handles
doublewrite / change-buffer / adaptive-hash-index
mechanics, and does not rely on the OS page cache (in
its common O_DIRECT configuration, it explicitly
bypasses it).
Two philosophies:
| Axis | Postgres | MySQL InnoDB |
|---|---|---|
| Relies on OS page cache | Yes — heavily | No (common config) |
| DB-owned cache size rule-of-thumb | ~25% of RAM | 50–75% of RAM |
| Double-buffered pages in RAM | Yes — acceptable | No — considered waste |
| Hit rate measurement | pg_stat_database |
SHOW ENGINE INNODB STATUS |
| Preferred IO mode | Buffered + OS readahead | O_DIRECT |
Both approaches work. The Postgres approach leans on the OS being a good generalist; the MySQL approach leans on the DB knowing its workload better than the OS.
Why ACID-aware caching is harder¶
Dicken's closing observation (Source: sources/2025-07-08-planetscale-caching):
Arguably, these are more complex than a "regular" cache as they also have to be able to operate with full ACID semantics and database transactions. Both databases have to take careful measures to ensure these pages contain accurate information and metadata as the data evolves.
Key complexities:
- Dirty page eviction must write back to WAL before overwriting, or crash-recovery loses durability.
- MVCC snapshot visibility — multiple transactions may read conflicting versions of the same page.
- Locking protocols (content locks, index locks) must be respected by the buffer manager.
- Checkpoint synchronisation — the buffer pool is the handoff between in-memory dirty state and durable-on-disk state.
These are all orthogonal to the LRU-ness of the eviction itself but they constrain which pages can be evicted at which moment.
Seen in¶
- sources/2025-07-08-planetscale-caching — Ben Dicken canonicalises the two-layer shape + the 25%-of-RAM sizing rule + the "more complex than a regular cache because of ACID" framing.
Related¶
- concepts/innodb-buffer-pool — MySQL's counterpart (single-layer design, bypasses OS cache).
- concepts/linux-page-cache — the OS-level layer underneath.
- concepts/cache-hit-rate
- concepts/working-set-memory — what you're actually sizing for.
- concepts/storage-latency-hierarchy — the latency-gap double-buffering amortises.
- systems/postgresql
- systems/mysql