CONCEPT Cited by 1 source
Cache hit ratio as memory-pressure signal¶
Definition¶
The Postgres buffer-cache hit ratio —
db.Cache.blks_hit / (db.Cache.blks_hit + db.IO.blk_read) —
is interpreted not as a tuning knob to maximise but as a
binary memory-pressure signal at an empirically chosen
threshold: below 80% means the database has insufficient
shared buffers or physical RAM.
"When clients request data, the database checks cached memory and if there is no relevant data there it has to read it from disk, thus queries become slower. Any values below 80% show that databases have insufficient amount of shared buffers or physical RAM. Data required for top- called queries don't fit into memory, and the database has to read it from disk."
Why the threshold is 80%, not 95% or 99%¶
Traditional database-tuning folklore advises aiming for "a high cache hit ratio" — often 95-99% — as a goal-directed KPI. Zalando inverts the framing: 80% is the breakpoint below which something has gone categorically wrong. Above 80%, cache hit ratio is a noisy workload- mix signal and not a useful tuning target. Below 80%, the working set has overflowed RAM and the remediation is capacity, not configuration.
The empirical justification is that disk IO is orders of
magnitude slower than RAM; a cache miss in Postgres
translates directly into block read latency
(D3 signal,
os.diskIO.rdsdev.await, P2 signal db.IO.blk_read_time)
and query slowdown. A 20% miss rate over a large number of
queries produces visible SLO impact.
What the threshold means operationally¶
When cache hit ratio drops below 80%:
- The working set has outgrown the buffer cache. Top-called queries' hot blocks no longer stay resident.
- Remediation is capacity, not tuning. Either the instance needs more RAM (larger instance class), or Postgres's shared_buffers setting is too small for the physical RAM available (rare on RDS where this is managed).
- Disk IO metrics will rise in parallel. P1 dropping correlates with D1/D2 (IOPS) increasing and D3 (storage latency) potentially increasing, because more cache misses mean more reads from storage.
The ratio is a cause-metric, not a symptom-metric¶
Cache hit ratio is classically a cause-metric: users do not experience cache hit ratio; they experience query latency. In Zalando's concepts/golden-signals-rds|12-signal methodology this metric is a P-series workload signal — intended for diagnosis after a symptom alert fires, not for paging on its own. See concepts/symptom-vs-cause-metric for the general distinction.
Relationship to other cache-hit-ratio concepts on the wiki¶
- concepts/cache-hit-rate — the general primitive. This page is the Postgres-specific calibration of the threshold.
- concepts/innodb-buffer-pool — MySQL InnoDB analogue; same framing applies.
- concepts/postgres-shared-buffers-double-buffering — the architectural quirk in Postgres where the OS page cache also caches blocks, so raw Postgres cache hit ratio understates true memory cache behaviour.
Seen in¶
- sources/2024-02-19-zalando-twelve-golden-signals — canonicalises the 80% threshold as signal P1 of the 12-golden-signals methodology.