Skip to content

PLANETSCALE 2026-03-31

Read original ↗

PlanetScale — Graceful degradation in Postgres

Summary

Ben Dicken's 2026-03-31 post reframes PlanetScale Traffic Control (already canonicalised via the 2026-04-11 Keeping a Postgres queue healthy post) from a mixed-workload contention lens to a graceful- degradation under load lens. The worked application is a social-media platform whose traffic (authentication, post fetching, likes, impressions, notifications, comments, trending, DMs, etc.) is partitioned into three priority tiers — critical (auth, post fetching, post creation, profile), important (comments, search, DMs), best-effort (like/impression/bookmark counts, trending, notifications, analytics) — and each tier receives a distinct Traffic Control budget. Under a viral-event / bad-deploy / DDoS load spike, the best-effort tier is the first to be shed (either automatically via its low Server share + low max concurrent workers, or manually via a live budget-disable) while critical keeps working. Canonical PlanetScale framing: stop serving non-critical components for a few minutes and users barely notice; let everything equally contend and the app becomes unusable and users leave.

The post also canonicalises the warn → enforce operational lifecycle for Traffic Control budgets (ship in warn mode, watch flagged-over-budget counts in Insights, tune limits, then switch to enforce) and the [PGINSIGHTS] Traffic Control: warning channel delivered in-band on Postgres query responses so applications can observe budget pressure without user-facing impact.

Key takeaways

  1. Not all traffic is created equal — and the default database load-handling model treats it as if it were. Under normal load this doesn't matter; under spike load it does. "Every query has an equal shot at consuming CPU and I/O, which means a flood of impression-count queries can starve the ones that users care most about, like authenticating and loading their timeline."

  2. Three-tier priority classification as a reusable template. Critical (app is broken without these: auth, post creation, post fetching, author profiles) / Important (noticeable if missing, app still usable: comments, post search, DMs — "oh hello 𝕏.com") / Best-effort (nice to have: like + impression + bookmark counts, trending topics, notifications, analytics). Canonical new concepts/query-priority-classification concept. "Your tiers will look different depending on your application. The point is to identify what you're willing to shed under pressure so that the things that matter most keep working."

  3. SQLCommenter as the tagging substratea standard for appending key=value metadata as a SQL comment, parsed by Insights, used by Traffic Control for budget-selection. You pick the keys: this post uses category=viewPost, priority=critical (both a fine- grained category axis and a coarse priority axis on every query).

    SELECT body, author_id, created_at FROM posts WHERE id = $1
    /* category='viewPost', priority='critical' */
    
    Budgets can then be set at either granularity — per-category (a dozen or so budgets, tuned independently) or per-priority (three coarse budgets). The post advocates the priority-level setup as the default starting point.

  4. Three priority-tier budget recipe (verbatim values from the post):

  5. critical-budget — apply to priority='critical' queries. No Server share, no Burst limit. Per- query max = 2 seconds (protects against rogue slow queries on critical path).
  6. important-budget — apply to priority='important'. Server share = 25% with moderate max concurrent workers. "Plenty of room for comments and notifications under normal conditions, but some will be blocked when traffic is unexpectedly high." Start in warn mode, switch to enforce after tuning.
  7. best-effort-budget — apply to priority='bestEffort'. Server share = 20% + low max concurrent workers. "Under normal load, this budget provides more than enough resource share for these lightweight queries." Under spike, traffic can be dynamically reduced further or completely shut off right from the PlanetScale app.

  8. Warn mode → enforce mode is the budget-tuning operational lifecycle, not a one-shot configuration. Canonical new concepts/warn-mode-vs-enforce-mode concept. "There's no need to get the tunings above perfect from day one. You can start every budget in warn mode. This will not kill any queries that exceed the budget. Rather, it will warn, and you can click into the budget to see how many queries are exceeding it over time." Only after the tuning stabilises does the budget flip to enforce. Canonical flow: comment → warn → monitor → enforce.

  9. In-band warning channel — over-budget events surface as [PGINSIGHTS] Traffic Control: warnings returned directly in the query response from Postgres so applications can observe the impact "from within your application without any user-facing effects." Canonical wiki datum: a managed database can attach diagnostic metadata to query responses alongside the actual row data, using the extension layer as the piggyback channel.

  10. Live budget changes as the load-shedding lever. Under a spike, "we can click into the best-effort-budget and completely disable this traffic. Changes to budgets happen live, so we would immediately see the impact of this." Operators do not need to deploy an application change to shed load — the budget-config surface is the lever. Canonical new patterns/shed-low-priority-under-load pattern: when capacity is exhausted, cut the lowest-priority traffic class at the infrastructure layer (budget disable), not by application-code changes.

  11. "Temporary degradation of non-critical functionality" vs "total outage" — the architectural reframe. "What could have been a huge lost-opportunity (your app becomes unusable) is now only a temporary degradation of non-critical functionality. We've kept our users happy and avoided an application outage." The same mechanism that protects the MVCC horizon in a Postgres-queue workload (sources/2026-04-11-planetscale-keeping-a-postgres-queue-healthy) becomes the user-facing graceful-degradation lever when applied to priority-tiered user traffic. Two problems, one mechanism.

Systems extracted

  • PlanetScale Traffic Control — framed here at the user-facing graceful-degradation altitude rather than the mixed-workload contention altitude. Same three dials (server share + burst, max concurrent workers, per-query limit) applied to a different problem class.
  • PlanetScale Insights — the over-budget observability surface. Budget violations visible in-app; [PGINSIGHTS]-channel warnings returned in query responses; warn-mode traffic counts surface over a 3-hour window.
  • PlanetScale Postgres — the substrate. Traffic Control is Postgres-exclusive on PlanetScale; upstream / AWS-RDS / GCP-SQL Postgres does not have this feature.
  • Postgres — the engine under the application. The [PGINSIGHTS] warning mechanism is delivered inside the Postgres extension layer.

Concepts introduced

Patterns introduced

  • patterns/shed-low-priority-under-load — the graceful- degradation-as-infrastructure pattern: classify traffic by user-perceived priority, apply per-class resource budgets that normally fit, cut the lowest-priority class at the budget layer under spike load. Sibling of the existing patterns/workload-class-resource-budget (same mechanism, different framing axis — the budget pattern is about coexistence; this pattern is about shedding).
  • patterns/workload-class-resource-budgetextended with the user-facing-priority instance. Previously framed via the 2026-04-11 Postgres-queue post at the MVCC-horizon / mixed-workload axis; this post adds the three-tier user- priority canonical application.

Operational numbers

  • critical-budget: no Server share / Burst limit; per-query cap = 2 seconds.
  • important-budget: Server share = 25%; moderate max concurrent workers.
  • best-effort-budget: Server share = 20%; low max concurrent workers; can be fully disabled live under spike.
  • Warn-mode data collection window illustrated in the post: 3-hour window, thousands of flagged-over-budget requests for a too-restrictive budget.
  • Spike scenario modelled: 10× increase in authentications, posts, likes, notifications, impressions, page loads driven by a "crazy news story or celebrity drama."

Caveats

  • Illustrative / pedagogical voice. No production-customer retrospective, no measured MTTR, no measured user-retention delta from Traffic Control under a real incident, no A/B comparison with a no-priority-tiers baseline.
  • Threshold-picking is declared "not hard" but the post doesn't give heuristics for the Server share split beyond the 25% / 20% example. The post's guidance is procedural (start in warn, tune down to where violations drop, flip to enforce) rather than numerical.
  • Classification burden is pushed to the application. Every query needs a SQLCommenter tag at the call-site; untagged queries fall into an unclassified default bucket. The operational cost of "tag every query path with the right priority" is not discussed.
  • Retry responsibility lives with the caller. Traffic Control blocks over-budget queries and expects the application to retry — if the application doesn't retry, throttling degrades from "smooth" to "fail." The post doesn't elaborate on retry-storm avoidance (exponential backoff, jitter) under a spike where many requests are simultaneously blocked.
  • PGINSIGHTS warning channel mechanics not fully specified. Whether the warning is a NoticeResponse Postgres-protocol message, a header, or an Insights-specific extension point is not disclosed; whether warnings are delivered on every over-budget query or sampled; whether a client library needs to know about the channel to observe them.
  • No quantification of the "temporary degradation" user- impact. The argument is structural ("users barely notice") rather than backed by user-engagement data for the worked shape.
  • Only Postgres scope. MySQL-side Traffic Control (if it exists on PlanetScale Metal for MySQL) is not discussed; the post is explicitly Postgres-centric.
  • Single-cluster scope. Multi-region / read-replica interactions with Traffic Control budgets are not discussed (do budgets apply per-instance or per-cluster? What happens during an automatic failover?).

Source

Last updated · 347 distilled / 1,201 read