Skip to content

PATTERN Cited by 1 source

Workload-class resource budget

Problem

A database cluster runs multiple workload classes concurrently: high-priority transactional traffic, medium-priority user-facing analytics, low-priority scheduled reports, job-queue workers. The classes compete for the same resources (CPU, I/O, backends, worker processes) but have different priorities and different blast radii if they fall behind.

Per-query timeouts (statement_timeout, transaction_timeout) can kill runaway individuals but cannot limit how many queries from a class can run concurrently. Continuously overlapping short queries from a low-priority class defeat all timeout-based controls — no individual query hits a duration limit, yet the class as a whole keeps the MVCC horizon pinned, starves autovacuum, or monopolizes I/O.

Autovacuum tuning can't help either — it has nothing it can legally reclaim while the horizon is pinned.

Solution

Tag queries with a workload class (typically via query-string metadata or a session variable) and apply a resource budget to each class with at least one of three dials:

  1. Maximum concurrent workers / backends — percentage of max_worker_processes that class can consume at any instant. Primary lever for protecting the MVCC horizon: cap action=analytics to e.g. 1 concurrent worker and the horizon opens between queries.
  2. Server share + burst limit — fraction of CPU/I/O the class may consume, with a burst ceiling. Smooths spikes from a class that occasionally runs heavy queries.
  3. Per-query limit — seconds of full-server resource use per individual query, independent of wall-clock time.

Queries that exceed a class's budget are blocked and the caller is expected to retry later, not killed outright. The total work happens, just at a rate the higher-priority class can coexist with.

Callers must implement retry logic — throttling without retry just moves the failure from "database dies" to "queries fail."

Tagging mechanism

Typical implementation: each class tags queries with SQLCommenter metadata appended to the SQL — e.g.

/* action=analytics,team=growth */
SELECT ... FROM ...

The resource-budget engine parses the comment, maps action=analytics to a budget definition, and enforces the dials. Non-tagged queries default to the "unclassified" bucket (typically unconstrained).

Canonical instance: PlanetScale Traffic Control

systems/planetscale-traffic-control is the canonical wiki instance — part of the Insights extension on PlanetScale Postgres. All three dials (server share + burst, per-query limit, max concurrent workers) are configurable per-budget. SQLCommenter tags select the budget.

Measured outcome in Griggs's 800-jobs/sec stress test with 3 concurrent overlapping 120-sec analytics queries (Source: sources/2026-04-11-planetscale-keeping-a-postgres-queue-healthy):

Metric Disabled Enabled (analytics cap = 1)
Queue backlog 155,000 jobs 0 jobs
Lock time 300+ ms 2 ms
Dead tuples at end 383,000 0–23,000 cycling
Analytics queries 3 concurrent, overlapping 1 at a time, 2 retrying
VACUUM effectiveness Blocked (horizon pinned) Normal (windows between queries)
Outcome Death spiral Completely stable
  • patterns/budget-enforced-quota-throttle (Pinterest Piqama) applies the same idea at Kubernetes-resource layer: quota-managed consumers must stay within an assigned budget or throttle.
  • AWS RDS Performance Insights + AWS Proactive Insights surface the telemetry but don't enforce class-level concurrency caps.
  • Postgres ALTER USER ... CONNECTION LIMIT is the coarsest version at the role granularity.
  • pgbouncer pool-mode pool_size per database/user can approximate concurrency caps at the proxy tier.

Each is a narrower version of the general shape: tag work by class, budget resources per class, retry-block excess.

When it doesn't apply

  • Single-workload clusters — no classes to distinguish.
  • Low-churn databases — MVCC horizon is a non-issue so the primary motivator is absent.
  • Workloads where every query is equal-priority (all-or-nothing throughput is the objective).

Seen in

  • sources/2026-04-11-planetscale-keeping-a-postgres-queue-healthy — Canonical wiki instance. Griggs argues that autovacuum tuning and Postgres-upstream timeouts are structurally insufficient for mixed-workload clusters because they target duration, not concurrency. Traffic Control is the pattern-instantiating mechanism, measured to convert a death-spiral to complete stability at 800 jobs/sec + overlapping analytics.
Last updated · 319 distilled / 1,201 read