Skip to content

PLANETSCALE 2026-03-30

Read original ↗

PlanetScale — High memory usage in Postgres is good, actually

Summary

Simeon Griggs (PlanetScale, 2026-03-30) opens with a houseplant analogy: the dashboard showing memory at 80% looks alarming the way a houseplant looks thirsty, but most operators project human discomfort onto the wrong signal. Sustained high CPU is always bad (work that cannot be skipped, queue depths grow, tail latency rises); sustained high memory is usually good (the system is keeping hot data close to the CPU). The structural reason is that the single "memory used" percentage conflates two fundamentally different populations: reclaimable cache (active cache, inactive cache, memory-mapped pages — dropped by the OS the moment something else needs RAM) and non-reclaimable process memory (RSS — stack, heap, catalog/relcache, sort + hash-table work_mem allocations, prepared-plan session memory). Only the second drives OOM risk. A healthy Postgres host should have most of its RAM consumed by shared_buffers + OS page cache; a memory number that's high because RSS is climbing is memory pressure, a different phenomenon with a different investigation playbook.

Key takeaways

  • Memory is workspace, CPU is work. The 0–100% scale conceals the asymmetry: sustained high CPU means queries arrive faster than they can be processed (queueing, timeouts, cascading failures); sustained high memory usually means the system has amortised its disk reads into RAM. "Unlike sustained high CPU, high memory usage by itself does not mean performance is degraded."
  • Reading from RAM is ~1000× faster than reading from an NVMe drive. This is the entire argument for why Postgres aggressively consumes RAM: "reading a page from RAM is roughly 1,000 times faster than reading it from even a fast NVMe drive. A database that keeps frequently accessed data in memory avoids that penalty on every query." PlanetScale Metal's locally-attached NVMe is fast, but still not as fast as cache. (See concepts/storage-latency-hierarchy.)
  • Two caching layers coexist by design. shared_buffers holds 8 KB Postgres pages keyed by relation+block; OS page cache holds raw filesystem pages. "Postgres was designed with this in mind, and its own documentation notes that the operating system's cache is expected to handle data beyond what fits in shared_buffers." The post explicitly discourages tuning shared_buffers as a first troubleshooting step.
  • The dashboard number conflates four sub-categories. PlanetScale's Cluster Metrics breaks memory into active cache (OS recently touched, wants to keep), inactive cache (not accessed lately), memory-mapped (cached pages backed by real on-disk files), and RSS (Resident Set Size — Postgres process private memory). All three cache sub-categories are reclaimable by the OS. Only RSS is not.
  • RSS is what drives OOM risk. Canonical framing: "If total memory is high because cache is high, good! Frequently accessed data stays near the CPU for faster access. … If total memory is high because RSS is high, that is referred to as memory pressure and is a problem." See concepts/memory-pressure.
  • RSS grows for enumerated reasons, not just query volume. Canonical list (new concepts/postgres-rss-growth-drivers concept): (1) multiple work_mem allocations within a single query (per sort / hash-node, per-query, per-backend — multiplicative); (2) catalog bloat especially in multi-tenant schemas using a table-per-tenant pattern; (3) OS allocator inefficiency (memory returned to heap but not to OS); (4) misbehaving or misconfigured extensions; (5) cached plans and prepared statements accumulating per-session until deallocation.
  • Tuning work_mem is not a safe default knob. "A single complex query can allocate work_mem multiple times, and that multiplies across every active connection. Setting it too low forces more disk I/O; setting it too high globally can cause total memory usage to spike unpredictably under load. Neither direction is a safe default change without first understanding your workload's concurrency and query complexity."
  • Connection pooling is the most effective RSS lever in process-per-connection architectures. "Efficient connection pooling can be the best way to reduce RSS usage. Fewer active connections result in fewer copies of all that per-process overhead." PlanetScale ships PgBouncer in transaction mode — connections returned to the pool at transaction boundary. Canonical new patterns/connection-pooling-to-reduce-rss pattern.
  • RSS is per-process, not per-query — so you can't ask EXPLAIN for "RSS of this query." Canonical debugging playbook (new patterns/triangulate-rss-to-query pattern): (1) Cluster Metrics to identify when RSS rises; (2) Query Insights in that window for expensive patterns (high runtime / CPU / I/O / rows/blocks read) and OOM-adjacent activity; (3) re-run suspect queries with EXPLAIN (ANALYZE, BUFFERS, MEMORY) for operator-level memory — the concepts/explain-analyze-buffers-memory primitive; (4) check concurrent connection counts in the same window.

Systems / concepts / patterns extracted

Systems

  • PostgreSQL — the substrate; the post's entire framing is Postgres-process-model-specific. Process-per-connection architecture is load-bearing for the RSS-grows-with-connections argument.
  • PlanetScale Postgres — the vendor surface where the dashboard metric appears; the breakdown into active cache / inactive cache / memory-mapped / RSS is PlanetScale's Cluster Metrics taxonomy.
  • PlanetScale Metal — named for locally-attached NVMe; the 1000× RAM-vs-NVMe gap argument is the reason Postgres uses cache even on the fastest storage tier.
  • PlanetScale Insights — Query Insights + Cluster Metrics named as the investigation surface; OOM markers are surfaced here.
  • PgBouncer — PlanetScale's in-proxy connection pooler in transaction mode named as the primary RSS reduction lever.
  • NVMe — named as the storage tier against which RAM is ~1000× faster.

Concepts

Patterns

  • patterns/connection-pooling-to-reduce-rssNEW. In process-per-connection database architectures, each backend carries baseline RSS (stack, heap, catalog/relcache, cached plans); reducing the number of concurrent backends via a pooler is the most structurally effective RSS lever. Transaction-mode PgBouncer is the canonical production implementation.
  • patterns/triangulate-rss-to-queryNEW. Four- step investigation pipeline for per-process metrics that aren't directly attributable to queries: cluster metrics identify the window → query insights finds candidates → EXPLAIN (ANALYZE, BUFFERS, MEMORY) quantifies operator memory → connection counts disambiguate fan-out.

Operational numbers / architectural data

  • ~1000× latency gap between RAM access and NVMe access — the quantitative foundation for caching at all. "Reading a page from RAM is roughly 1,000 times faster than reading it from even a fast NVMe drive."
  • 8 KB default Postgres page size (the fixed chunk of table/index data shared_buffers works with).
  • Four memory sub-categories in PlanetScale's Cluster Metrics: active cache, inactive cache, memory-mapped, RSS.
  • Three reclaimable (all cache types); one non- reclaimable (RSS).
  • Process-per-connection Postgres architecture — every connection spawns a backend process with its own baseline RSS footprint.
  • Transaction mode PgBouncer — connections returned to the pool after each transaction; the default pooling mode on PlanetScale.

Caveats

  • Pedagogical / dashboard-explainer voice — no PlanetScale production retrospective, no customer war story, no numeric case (no "customer at X GB RSS hit OOM at Y connections"). Structure is dashboard-screenshot + narrative.
  • 1000× RAM-vs-NVMe figure is a standard order-of- magnitude claim, not a PlanetScale measurement. Actual gap varies by workload (random vs sequential, queue depth, cache-line-size effects).
  • RSS growth drivers enumerated but not quantified. Which driver dominates at PlanetScale scale is not disclosed. No distribution across customers, no "X% of OOM incidents trace to catalog bloat."
  • work_mem tuning guidance is qualitative. The post declines to provide a concrete rule because "neither direction is a safe default change" — but this means operators are left to tune empirically without a starting point.
  • The EXPLAIN (ANALYZE, BUFFERS, MEMORY) option is named but not worked through. A reader who doesn't already know EXPLAIN output won't know how to interpret the memory column.
  • Catalog bloat from multi-tenant schemas is flagged but not operationalised (no threshold, no measurement, no mitigation strategy beyond the parenthetical).
  • Tail-latency + heavy-disk-I/O pairing is named as a memory-pressure signal but the causal chain (working set doesn't fit → more misses → more disk → tail latency) is implicit; readers expected to connect the dots.
  • Extension-misbehaviour is named as a cause but not instrumented (how would an operator detect a leaking extension?).
  • No cross-engine comparison. MySQL InnoDB's single-layer buffer pool (no OS page cache reliance in O_DIRECT configurations) behaves differently; the post is Postgres-specific.

Source

Related

Last updated · 347 distilled / 1,201 read