Skip to content

PATTERN Cited by 1 source

Triangulate RSS growth from metrics

Problem

RSS is a per-process metric, not a per-query metric. When Postgres RSS climbs toward memory-pressure limits, the operator needs to know which queries are responsible — but EXPLAIN (ANALYZE, BUFFERS) in isolation shows only I/O and runtime, not process-memory footprint, and Query Insights does not expose per-query memory.

Simeon Griggs (PlanetScale, 2026-03-30) names the gap explicitly:

"Query Insights helps you investigate query performance through CPU time, I/O, and latency, but it does not show per-query memory. You may see OOM markers and slow-query signals, but not query-specific RSS usage. RSS is a per-process metric, not a per-query metric. That means you cannot read 'RSS per query' directly from EXPLAIN or Query Insights."

(Source: sources/2026-04-21-planetscale-high-memory-usage-in-postgres-is-good-actually.)

Pattern

Four-signal triangulation workflow:

1. Cluster Metrics
   → identify the time window when RSS rose
2. Query Insights (same window)
   → find expensive query patterns (high runtime / CPU / I/O / rows-read)
   → find OOM-adjacent activity (error markers, connection resets)
3. EXPLAIN (ANALYZE, BUFFERS, MEMORY) on suspects
   → operator-level memory attribution at the plan-node level
4. Connection count (same window)
   → rule out high-concurrency RSS contribution
     (many moderate-memory backends > few memory-hungry queries)

The four signals constrain the suspect space from different angles:

  • Cluster Metrics gives the when.
  • Query Insights gives the what (candidate query set).
  • EXPLAIN options gives the per-node cost.
  • Connection count gives the concurrency multiplier.

None of them alone is sufficient; together they typically isolate the root cause.

Worked example

RSS rose from 4 GB to 18 GB over 10 minutes, and OOM kill events appeared at the peak.

  1. Cluster Metrics: narrow to the 10-minute window.
  2. Query Insights for that window: one analytics query pattern — SELECT ... ORDER BY date DESC, id DESC LIMIT N with a large N — appears with 50+ concurrent executions, each running for 30-60 seconds.
  3. EXPLAIN (ANALYZE, BUFFERS, MEMORY): the query produces a large on-disk sort because the combined index order doesn't match the ORDER BY; the sort node has a 150 MB work-memory footprint.
  4. Connection count: 50 concurrent backends × 150 MB sort = 7.5 GB of transient RSS just from this one query shape — consistent with the observed 14 GB RSS jump.

Root cause: work_mem multiplication across concurrent executions of an analytics query without an index matching its ORDER BY.

Remediation: - Short-term: throttle the analytics workload in Traffic Control. - Long-term: add a matching index; or refactor query to use index-order pagination.

Variations

  • pg_stat_activity snapshot during the rise window to see per-backend query text — coarser than Query Insights but available without extension install.
  • top / ps on the DB host for per-backend RSS — only accessible to operators with host-level access (not customers on managed services).
  • Cumulative pg_stat_statements diff across the window to find queries with the biggest change in execution count.

Relationship to other memory-attribution patterns

Seen in

Last updated · 550 distilled / 1,221 read