PlanetScale — Anatomy of a Throttler, part 1¶
Summary¶
Shlomi Noach (creator of gh-ost,
co-author of the Vitess throttler, ex-GitHub, now at PlanetScale)
opens a three-part series on throttler design for database
systems. Part 1 is the language-establishing post: it defines
what a database throttler is, what
workloads it is built for, why generic rate limiters fail at
databases, how it contrasts with the collaborative-vs-barrier shape
choice, and most importantly walks five candidate MySQL throttler
metrics — replication lag,
threads_running,
transaction commit delay,
queue length / load average, and
pool exhaustion — framing
each as a symptom summarising
some underlying queue (queueing
theory). Along the way it canonicalises the architectural shape
(multi-metric throttling),
the sampling-interval
trade-off (oversample 2–5×
the threshold range), the emergent
threshold-as-standard
steady state, and the
observability-
substitute consequence — introducing the throttler perturbs the
system it was installed to protect, and the throttler's own logs
become the primary debugging surface for throttled work. Parts 2
and 3 extend the framework along the
deployment-topology axis
(singular vs distributed, fail-open vs fail-closed, hibernation)
and the
client-side axis
(client identity, probabilistic rejection, enforcement-mode
throttlers) respectively.
Key takeaways¶
-
Throttler = domain-specific backpressure primitive. "A throttler is a service or component that pushes back against an incoming flow of requests to ensure the system has the capacity to handle all received requests without being overwhelmed. In this series of posts, we illustrate design considerations for a database system throttler, whose purpose is to keep the database system healthy overall." (Source: sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-1). Canonical wiki framing for database throttler.
-
Target workload is asynchronous / batch / massive, not OLTP. "We focus on throttling asynchronous, batch, and massive operations that are not time critical. Examples could be ETLs, data imports, online DDL operations, mass purges of data, resharding, and so forth. The throttler will push back on those operations that can span minutes, hours, or days of operation. Other forms of throttling may push back on OLTP production traffic. This discussion applies equally to both." The 10-million- row-import worked example is canonical: 10M rows in 100-row batches = 100,000 subtasks, each gated by a throttler check.
-
Collaborative vs barrier is the first shape choice. "Some throttler implementations are collaborative, meaning they assume clients will respect their instructions. Others act as barriers between the app and the database." The rest of Part 1 focuses on the collaborative model; Part 3 canonicalises the barrier model as the answer to rogue / malfunctioning clients. Canonical wiki framing for collaborative throttler check API.
-
Subtask sizing is a two-sided tuning problem. "Each subtask should be small enough so as not to single-handedly tank the database's serving capacity, while large enough to compensate for the added throttler overhead and to enable meaningful progress." Canonical instance: 100 rows per subtask. Too small → throttler overhead dominates; too large → one subtask can degrade the database before the next check fires.
-
Generic rate limiters are the wrong primitive for databases. "Some generic throttlers only allow a regulated rate of requests, in anticipation that the consuming job will be able to process them at some known, fixed rate. With databases, things are less clear. A database can only handle so many queries at any given point in time, or over some period of time. However, not all queries are created equal. The database capacity of serving queries depends on the scope of queries, any hot spots or cold spots in affected data, the state of the page cache, overlap or lack thereof of data served by queries, to name a few factors." Throttlers admit on system-health signals, not on request rate.
-
Replication lag is the canonical MySQL throttling signal. "In the MySQL world, replication lag is probably the single most used throttling indicator, as multiple third-party and community tools use it to push back against long-running jobs. This is for good reasons: it is easy to measure and it has clear impact on the product and the business. For example, in the case of a database failover, replication lag impacts the time it takes for a replica/standby server to be promoted and made available to receive write requests. Read-after-write can be simplified when replication lag is low, allowing secondary servers to serve some of the read traffic." Canonical replication-lag framing.
-
Symptom vs cause: the load-bearing reframe. Noach walks each candidate metric and asks "what queue is this actually measuring?" His core insight: "Circling back to replication lag, much like concurrent queries, it is a symptom. E.g. disk I/O is saturated on the replica, hence the replica cannot keep up replaying the changelog, thereby accumulating lag. Or perhaps the lag is caused by slow network. Or both! Whatever the case is, what's interesting is that the replication mechanism itself is a queue: the changelog event queue." Every good throttling metric is a symptom metric summarising an underlying queue, and that's precisely why it's useful — the metric composes multiple failure modes into one actionable signal.
-
threads_runninghas no portable threshold. "But, what's an acceptable value? Are50concurrent queries OK? Are100OK? Pick a number, and you'll soon find it doesn't hold water. Some values are acceptable in early morning, while others are just normal during peak traffic hours. As your product evolves and its adoption increases, so do the queries on your database. What was true 3 months ago is not true today. And again, not all queries are created equal." Load-bearing framing against arbitrary-threshold metrics: thresholds must be derived from physics (hardware-bound, like fsync time) or from system configuration (connection-pool size), not guessed. -
Queue-length vs queue-wait-time is the deeper choice. "A long queue at the airport isn't in itself a bad thing, some queues move quite fast, and yet it's often a predictor to wait times. Where wait time is impossible or difficult to measure, queue length can be an alternative." Canonical wiki framing for concepts/queue-length-vs-wait-time: transaction-commit-delay and replication-lag are wait-time metrics;
threads_runningand load-average are length metrics. Wait-time is preferred when measurable. -
Pool exhaustion is the canonical "natural threshold" metric. "An exhausted pool is a strong indication of excessive load, while the difference between a
60%and an80%used pool is not as clear an indication. Taking a step back, what does it mean that we exhaust some pool? Who decides the size of the pool in the first place? If someone picked a number such as50or100, isn't that number just artificial? It may well be, but pool size was likely chosen for some good reason(s). It is perhaps derived from some database configuration, which is itself derived from some hardware limitation." Canonical design principle: the throttler inherits existing system configuration as its threshold rather than introducing new artificial ones. See concepts/connection-pool-exhaustion. -
Multi-metric is the recommended architectural shape. "A throttler should be able to push back based on a combination of metrics, and not limit itself to just one metric. We've illustrated some metrics above, and every environment may yet have its own load predicting metrics. The administrator should be able to choose an assorted set of metrics the throttler should work with, be able to set specific thresholds for each such metric, and possibly be able to introduce new metrics either programmatically or dynamically." Canonical framing for patterns/multi-metric-throttling — the recommended shape and the one Vitess 21 ships in production.
-
Throttler-as-observability-substitute (the Heisenbug analogue). "Many software developers will be familiar with the next scenario: you have a multithreaded or otherwise highly concurrent app. There's a bug, likely a race condition or a synchronization issue, and you wish to find it. You choose to print informative debug messages to standard output, and hope to find the bug by examining the log. Alas, when you do so, the bug does not reproduce. Or it may manifest elsewhere." The corresponding shift for throttled databases: "All of a sudden, there is less contention on the database, and certain apps that used to run just fine, exhibit contention/latency behavior. … But where previously you could clearly analyze database queries to find the root cause, the database now tells you little to nothing. It's now down to the throttler to give you that information." Canonical framing for patterns/throttler-observability-substitute.
-
Threshold-as-standard: the emergent steady state. "As we start the operation, we expect to see the replication lag graph jump up to the threshold value, and then more or less stabilize around that value, slightly higher and slightly lower, for the duration of the operation, which could be hours." And: "It is not uncommon for a system to run one or two operations for very long periods, which means what we consider as the throttling threshold (say, a
5secreplication lag) becomes the actual standard." Implication: the throttler threshold is not an upper bound on rare excursions — it is the expected steady-state value during pushing workloads. Canonical framing for concepts/throttler-threshold-as-standard. -
Sampling interval bounds throttler responsiveness. Worked timing example: heartbeat injected at
12:00:00.000, sampled at12:00:00.995(captures it), client checks at12:00:01.990and gets a response based on a sample that is nearly 2 seconds stale describing a metric that was nearly 1 second old at capture time. "Long heartbeat intervals and outdated information have negative impacts on both our system health as well as the throttler's utilization." Canonical framing for sampling interval as a control-loop parameter. -
Release-thundering-herd from long intervals + shared metric. "When multiple operations attempt to make progress all at once, all will be throttled while metrics are above threshold, and possibly all released at once when metrics return to low values, thus all pushing the metrics up at once." Long sampling intervals synchronise client releases; shorter intervals desynchronise them.
-
Oversample 2–5× the threshold range (Nyquist-inspired). "Borrowing from the world of networking hardware, it is recommended that metric interval and granularity oversample the range of allowed thresholds. For example, if the acceptable replication lag is at 5 seconds, then it's best to have a heartbeat/sampling interval of 1-2 seconds." Canonical framing for concepts/oversampling-metric-interval — a signal-processing-derived design rule for throttler sampling.
-
Higher sampling fidelity has a real cost. "Lower intervals and more accurate metrics reduce spikes and spread the workload more efficiently. That, too, comes at a cost, which we will discuss in a later post." Forward pointer to Part 2's discussion of hibernation and cost-aware sampling.
Systems named¶
| System | Role in post |
|---|---|
| systems/mysql | Canonical database engine for every worked metric. |
| systems/vitess | The production Vitess throttler is Noach's implicit reference architecture throughout the series. |
| systems/vitess-throttler | Named explicitly in Part 2; Part 1 is the language-establishing prelude for it. |
| systems/planetscale | Publisher; the throttler is part of the PlanetScale-managed Vitess stack. |
pt-heartbeat |
The canonical replication-lag-heartbeat tool the post implicitly references. Explicitly named in Part 2. |
Concepts canonicalised¶
| Concept | First wiki home or Part-1 extension |
|---|---|
| concepts/database-throttler | Canonical definition. |
| concepts/replication-lag | Canonical MySQL throttling-signal framing + threshold-as-standard dynamic. |
| concepts/threads-running-mysql | Symptom-metric framing; no portable threshold. |
| concepts/transaction-commit-delay | Wait-time symptom of the commit / fsync queue. |
| concepts/load-average | Queue-length proxy with hardware-sensitive threshold. |
| concepts/connection-pool-exhaustion | Natural-threshold metric inherited from system configuration. |
| concepts/queueing-theory | The substrate framing every throttler metric sits on. |
| concepts/symptom-vs-cause-metric | Canonical reframe: symptom metrics are more useful for throttling because they compose multiple failure modes. |
| concepts/queue-length-vs-wait-time | Deeper metric-design axis: prefer wait-time where measurable. |
| concepts/metric-sampling-interval | Control-loop parameter; staleness + jitter + release-thundering-herd problems. |
| concepts/oversampling-metric-interval | 2–5× oversampling rule of thumb. |
| concepts/throttler-threshold-as-standard | Emergent steady state of long-running throttled workloads. |
Patterns canonicalised¶
| Pattern | First wiki home or Part-1 extension |
|---|---|
| patterns/collaborative-throttler-check-api | Canonical check-then-proceed client contract. |
| patterns/multi-metric-throttling | Recommended architectural shape. |
| patterns/heartbeat-based-replication-lag-measurement | Canonical measurement mechanism whose sampling-interval trade-offs Part 1 analyses. |
| patterns/throttler-observability-substitute | Architectural consequence: the throttler's log becomes the primary on-call surface for throttled work. |
Operational numbers and worked examples¶
| Datum | Value | Context |
|---|---|---|
| Worked import size | 10 million rows | The post's running example workload. |
| Worked subtask size | 100 rows per subtask | Canonical wiki datum for collaborative-throttler subtask sizing. |
| Worked replication-lag threshold | 5 seconds | Canonical worked example for the threshold-as-standard dynamic. |
| Worked heartbeat interval | 1 second | Canonical worked example for the sampling-staleness timing walk. |
| Worked sample interval | 1 second | Same. |
| Worked worst-case staleness | ~2 seconds | Sample captured at t+0.995 s, client reads at t+1.990 s. |
| Oversampling rule of thumb | 2–5× threshold range | Canonical Nyquist-inspired design rule. |
| Load-average heuristic | 1 per CPU |
Named as "a common rough indicator" with the caveat "some database deployments famously push their servers to their limits with load averages soaring far above 1 per CPU." |
Caveats¶
- Language-establishing post, not architecture disclosure. Part 1 is deliberately pedagogical — the post introduces terminology and design-space geometry that Parts 2 (deployment topology) and 3 (client side) build on. Readers who want the mechanism should read all three.
- Vitess throttler implied but not yet named. Part 1 never uses
"Vitess" in body text; the reference implementation is
implicit. Part 2 names
vttabletand the tablet throttler explicitly. - MySQL-specific metrics throughout. Replication lag,
threads_running, transaction commit delay, and pool usage are MySQL-flavoured; the Postgres equivalents (pg_stat_wal_receiver/pg_stat_activity/pg_stat_bgwriter/pg_stat_database) are not discussed. Cross-engine generalisation is the reader's exercise. - Heartbeat mechanism invoked but not mechanised. "Replication lag can be measured in different methods, and the most common one is by deliberate injection of heartbeat events on the primary, and by capturing them on a replica. More on this in another post." Canonical forward-reference to Part 2.
- Oversampling-cost deferred. "That, too, comes at a cost, which we will discuss in a later post." Forward-reference to Part 2's hibernation / cost-aware sampling discussion.
- No production numbers. No QPS/latency/rejection-rate/adoption figures; the post canonicalises design principles. Production telemetry for Vitess's throttler is scattered across later Noach + Vitess-release posts.
- Tier-3 source, default-include voice. Shlomi Noach is named on companies/planetscale as a default-include byline (ex-GitHub war-story + Vitess-throttler author). This post is Tier-3-clearing by construction.
Cross-source continuity¶
- Opens the three-part throttler series. Parts 2 and 3 extend the framework along the deployment-topology axis (sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-2) and the client-side axis (sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-3).
- Pairs with the same-day Vitess 21 release notes which ship the multi-metric throttler primitive canonicalised here as design-space geometry. The design-space post plus the shipping primitive bracket the throttler story on publication day 2026-04-21 (the Noach posts are 2024-08-29 publications re-surfaced by PlanetScale's 2026-04-21 RSS republishing campaign; the Vitess 21 release is contemporaneous with the re-surface).
- Substrate beneath patterns/multi-metric-throttling's Vitess-21 production canonicalisation. Part 1 articulates the design shape; Vitess 21's release notes canonicalise the shipping production artefact.
- Complements concepts/queueing-theory's throttler-metric subsection — that page names the queue-substrate of every metric Noach walks in Part 1; this source is the canonical origin of that framing on the wiki.
- Complements ** — that post canonicalises the two-tier-pool substrate that keeps headroom available upstream of the connection-pool-exhaustion signal this post canonicalises as a throttling metric.
- Complements ** — that post canonicalises replication-lag as a monitoring signal; this post canonicalises it as a throttling signal.
- Complements ** — that post canonicalises systems/planetscale-traffic-control as a Postgres-side throttler on per-workload-class resource budgets** (SQLCommenter-tag-based) rather than on cluster-health signals; both sit under concepts/database-throttler with different metric models. Part 1 canonicalises the signal-based side; Traffic Control canonicalises the tag-budget side.
- No existing-claim contradictions — strictly additive.
Source¶
- Original: https://planetscale.com/blog/anatomy-of-a-throttler-part-1
- Raw markdown:
raw/planetscale/2026-04-21-anatomy-of-a-throttler-part-1-48d5011e.md
Related¶
- companies/planetscale — publisher; default-include Noach byline.
- systems/vitess — implicit reference architecture.
- systems/vitess-throttler — canonical production instance implementing the design space this post canonicalises.
- systems/mysql — engine of every worked metric.
- concepts/database-throttler — parent primitive defined here.
- concepts/queueing-theory — the substrate framing every throttler metric sits on.
- concepts/symptom-vs-cause-metric — the load-bearing reframe for throttler-metric selection.
- concepts/replication-lag / concepts/threads-running-mysql / concepts/transaction-commit-delay / concepts/load-average / concepts/connection-pool-exhaustion — the five candidate metrics walked in the post.
- concepts/metric-sampling-interval / concepts/oversampling-metric-interval — the sampling-interval trade-offs walked in the post.
- concepts/throttler-threshold-as-standard — the emergent steady state of long-running throttled workloads.
- concepts/queue-length-vs-wait-time — the deeper metric-design axis.
- patterns/collaborative-throttler-check-api — the client contract.
- patterns/multi-metric-throttling — the recommended architectural shape.
- patterns/heartbeat-based-replication-lag-measurement — the canonical measurement mechanism.
- patterns/throttler-observability-substitute — the architectural consequence.
- sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-2 — series continuation on deployment topology.
- sources/2026-04-21-planetscale-anatomy-of-a-throttler-part-3 — series closure on the client-side axis.