PlanetScale — Anatomy of a Throttler, part 1¶
Summary¶
Shlomi Noach (author of gh-ost and the Vitess throttler) opens a
series on throttler design for database systems. Part 1 is a
language-establishing post: what a throttler is, what metric(s) it
should push back on, and why every candidate metric is subtler than
it looks. The central claim is that every load-predicting metric
is a symptom, not a cause — useful because it summarises a chain
of underlying queues (disk, CPU, network, locks) rather than
instrumenting any one of them. The post walks through replication
lag, threads_running, transaction-commit delay, queue length, load
average, and connection-pool usage as throttling signals, argues
that a throttler should push back on a combination of metrics,
and closes by framing two properties every throttler operator has to
reason about: (1) introducing a throttler changes the application's
behaviour and shifts the observability burden off the database onto
the throttler; (2) metric sampling/heartbeat intervals must
oversample the acceptable-threshold range (networking-hardware
rule of thumb) to avoid stall/release oscillation.
Key takeaways¶
-
A throttler pushes back on incoming flow so the system stays healthy. The target workload Noach focuses on is asynchronous, batch, massive operations — ETLs, online DDL, mass purges, resharding — that span minutes to days and cannot be allowed to single-handedly tank production. The job breaks itself into subtasks (e.g. 100 rows at a time from a 10M-row import) and asks the throttler for permission before each one.
-
Collaborative vs barrier throttlers. Some throttlers assume clients respect a check-then-proceed contract (collaborative — patterns/collaborative-throttler-check-api); others act as barriers between app and database. Either shape rejects when unhealthy; the job backs off and retries. Subtask size is tuned between small enough not to single-handedly tank the database and large enough to make progress against the throttler's per-check overhead.
-
Not all queries are created equal — so throttle on health, not on rate. General-purpose rate-limiters assume fixed worker cost per item. A database's cost-per-query depends on query scope, hot-spot distribution, page-cache state, data overlap — none of which the throttler can know in advance. The throttler must therefore push back on health signals, not on request rate.
-
Replication lag is the most-used MySQL throttling signal for a reason. Easy to measure, directly impacts the business (concepts/replication-lag governs failover promotion time, read-your-writes feasibility, and read-replica usefulness). Tools in the MySQL ecosystem (
pt-online-schema-change,gh-ost, Vitess) default to replication-lag throttling. -
threads_runningis a useful symptom but has no stable threshold. (concepts/threads-running-mysql) Acceptable values shift by time-of-day, by product evolution, by query mix. Pick 50, pick 100 — neither number holds water across environments. "An experienced administrator may only need to take one look at this metric … to say 'we're having an issue'", but a throttler cannot encode that intuition as a static threshold. -
Every load-predicting metric is a symptom of underlying queues. (concepts/symptom-vs-cause-metric) Replication lag = changelog-event queue delay (network queue + local disk write queue + wait + event processing time). A spike in concurrent queries = commit-queue backup (disk flush) or lock-wait. The metric is useful because it summarises a chain of queues, not because it pinpoints a single bottleneck.
-
Queue delay is more robust than queue length, when measurable. (concepts/queue-length-vs-wait-time) "A long queue at the airport isn't in itself a bad thing — some queues move quite fast." Wait-time is the user-perceived metric; queue length is a predictor that's cheaper to capture when wait-time measurement is hard.
-
Load average inherits the threshold-unreliability problem. (concepts/load-average) The classical
1per CPU rough indicator is system-dependent — "some database deployments famously push their servers to their limits with load averages soaring far above 1 per CPU." Same symptom-vs-cause caveat asthreads_running. -
Connection-pool exhaustion is the one signal with a natural threshold. (concepts/connection-pool-exhaustion) Pool size was chosen based on database configuration (max connections / buffer-pool sizing / memory per backend), so "pool is 100% used" isn't an arbitrary line — the throttler inherits the existing system configuration as its threshold without introducing a new artificial number.
-
A throttler should combine metrics, not pick one. (patterns/multi-metric-throttling) "A throttler should be able to push back based on a combination of metrics, and not limit itself to just one metric." Operators set per-metric thresholds; advanced setups should allow adding new metrics programmatically or dynamically.
-
Introducing a throttler changes the behaviour you're trying to observe. Analogy to the multithreaded-debugging gotcha where adding
printfchanges the race: a throttler introduces new contention and reduces production contention, which un-masks apps that were running fine because they were piggy- backing on queue head-of-line delay. The root-cause surface shifts from database query traces to throttler check logs — the apps themselves don't do anything; they just ask the throttler, so the database has nothing to tell you about them. -
The throttler threshold becomes the steady-state metric value. (concepts/throttler-threshold-as-standard) For a workload massive enough to push against the threshold, the metric graph looks like "the metric goes up to the threshold value, then back down, and pushes again" for the duration of the operation. "It is not uncommon for a system to run one or two operations for very long periods, which means what we consider as the throttling threshold (say, a 5sec replication lag) becomes the actual standard." This is how a healthy throttled system looks.
-
Metric sampling/heartbeat intervals must oversample the threshold range. (concepts/metric-sampling-interval, concepts/oversampling-metric-interval) Borrowed from networking-hardware rule of thumb: if acceptable replication lag is 5 s, sample every 1–2 s, heartbeat every 1–2 s. Long intervals miss uptick (system degrades before throttler engages), miss recovery (throttler blocks during the whole interval after the metric clears), and cause release-thundering-herd when many throttled jobs see the all-clear at the same moment and push the metric back above threshold in lockstep.
Architectural specifics¶
Target workload¶
- Async, batch, massive operations: ETLs, data imports, online DDL, mass data purges, resharding.
- Duration: minutes → hours → days.
- Example: 10M-row import broken into 100-row subtasks; each subtask is a throttler-check + small batch apply + repeat.
- Discussion also "applies equally" to throttling OLTP production traffic — the architectural choices are the same, only the deployed thresholds differ.
Throttler shapes¶
| Shape | Contract |
|---|---|
| Collaborative | Client checks before acting; respects the response. Vitess throttler + gh-ost + pt-online-schema-change are of this shape. |
| Barrier | Sits in the request path; rejects directly. |
Either way: if not OK, client backs off for some period and retries.
Subtask-size trade-off¶
- Small enough not to single-handedly tank the database.
- Large enough to make progress net of per-check throttler overhead.
- No specific number quoted — depends on workload + database size + throttler check cost.
Metric catalogue Noach walks through¶
| Metric | Symptom of | Has natural threshold? |
|---|---|---|
| Replication lag | Changelog queue delay (net + disk + wait + processing) | Business-derived (failover RTO, read-your-writes tolerance) — yes, per product |
threads_running |
Commit-queue backup, lock-waits, page-cache misses | No — time-of-day / product-evolution / query-mix dependent |
| Transaction commit delay | Disk flush latency | Hardware-derived (SSD flush ≠ Aurora commit ≠ HDD); operator must know their values |
| Queue length | Predicts wait time where wait time is unmeasurable | No — airport-queue analogy |
| Load average per CPU | Run-queue + I/O-wait demand | Rough 1 per CPU rule-of-thumb; system-dependent |
| Pool exhaustion | Concurrent operations waiting on connection | Yes — natural, set by existing system configuration |
Replication-lag measurement: heartbeat injection¶
- Deliberate heartbeat events injected on the primary (rate = one per second in the worked example).
- Captured on replica by matching heartbeat timestamps to wall-clock time at read.
- Granularity of lag measurement = heartbeat injection interval.
- Worked example of the sampling-skew problem: heartbeat at
12:00:00.000, sample at12:00:00.995captures that heartbeat, client checks at12:00:01.990gets a response based on a sample that's now ~2 s old — throttler reading a 2-s-old version of a 1-s-old event.
Steady-state throttled behaviour¶
For a workload large enough to push against the threshold:
"The operation will be granted access thousands of times or more, and will likewise also be rejected access thousands of times or more. That is how a healthy system looks with a throttler engaged."
Thresholding shifts the metric's meaning: "what we consider as the throttling threshold (say, a 5 sec replication lag) becomes the actual standard."
Observability shift¶
With a throttler in place:
- Apps do nothing except ask the throttler.
- Database has nothing to show about those jobs — no query traces, no lock profiles, no expensive-query logs.
- Root-cause signal moves to the throttler: what was it rejecting, when, why (which metric's threshold was breached).
- Canonical instance of patterns/throttler-observability-substitute — the throttler becomes the single source of truth about rejected work.
Multi-job dynamics¶
When multiple massive operations run concurrently under a throttler:
- All are throttled while metric is above threshold.
- All released simultaneously when metric drops — all push the metric up at once, causing re-throttle oscillation.
- Avoidance: shorter sampling/heartbeat intervals + oversampling the threshold range.
Operational numbers disclosed¶
- Worked example: 10M rows imported in 100-row subtasks (= 100,000 subtasks, 100,000 throttler checks minimum).
- Heartbeat interval worked example: 1 per second.
- Sample interval worked example: 1 per second.
- Acceptable replication lag worked example: 5 seconds.
- Oversampling recommendation: sample interval at 1–2 seconds for a 5-second threshold.
- "Thousands or more" grants and rejections per operation for a pushed-against-threshold workload.
Caveats¶
- Part 1 of a series. Noach flags that singular vs. distributed throttler design + throttler impact on the environment are deferred to later posts. Mechanism-level details (how the Vitess throttler implements its metric-collection loop, how it communicates across shards, how individual clients discover the throttler endpoint) are not in this post.
- MySQL-centric. Every named metric (replication lag via
binlog / GTID,
threads_running, InnoDB transaction-commit queue) is a MySQL primitive. Postgres-equivalents named at a concept level only (replication lag is a cross-engine notion). - No production numbers. No customer-fleet statistics on throttler-engagement rates, throttler-caused job slowdown measurements, or throttler-threshold-tuning war stories.
- Collaborative vs barrier trade-off not compared. The post names both shapes but picks collaborative for the discussion. Barrier-shape trade-offs (enforcement strength, performance cost, operator complexity) are not walked.
- Sampling math is qualitative. The oversampling rule of thumb is stated as borrowed-from-networking but not derived from Nyquist + actual workload-metric spectral analysis. No formal stability-analysis of the throttler control loop.
- "Metric threshold becomes standard" is stated as observation not as problem. The post doesn't discuss what to do if you don't want your steady-state system running at the threshold ceiling for hours — e.g. should the threshold be a soft signal with a secondary hard-stop? Deferred.
- Fit with Vitess. The post is authored by the Vitess throttler
maintainer but framed as general throttler-design education,
not as Vitess-specific internals. Related Vitess specifics —
/throttlerHTTP endpoint,/throttler/checkAPI,lag/load-avg/custommetric types — are not walked here.
Source¶
- Original: https://planetscale.com/blog/anatomy-of-a-throttler-part-1
- Raw markdown:
raw/planetscale/2026-04-21-anatomy-of-a-throttler-part-1-48d5011e.md
Related¶
- concepts/database-throttler — canonical definition of the throttler primitive this post introduces.
- concepts/replication-lag — Noach's preferred MySQL-world throttling metric.
- concepts/symptom-vs-cause-metric — the general principle
underpinning his case that replication-lag +
threads_running+ load-average are all useful as symptoms of underlying queues. - concepts/queueing-theory — every metric Noach discusses is a summary of a queue somewhere in the stack (changelog, commit, run-queue, connection pool).
- patterns/multi-metric-throttling — the recommended architectural shape.
- patterns/collaborative-throttler-check-api — the
check-before-you-act contract used by Vitess /
gh-ost. - patterns/heartbeat-based-replication-lag-measurement — the injection-and-capture mechanism underlying lag-based throttling.
- patterns/throttler-observability-substitute — the consequence that the throttler becomes the root-cause surface once apps are passive clients.
- systems/vitess-throttler — implicit subject of Noach's authorship; the Vitess implementation of the shape described here.
- companies/planetscale — Tier-3 blog, engineering voice alongside Ben Dicken / Matt Lord / Vicent Martí / Simeon Griggs / Harshit Gangal.