Skip to content

PLANETSCALE 2024-08-29

Read original ↗

PlanetScale — Anatomy of a Throttler, part 1

Summary

Shlomi Noach (author of gh-ost and the Vitess throttler) opens a series on throttler design for database systems. Part 1 is a language-establishing post: what a throttler is, what metric(s) it should push back on, and why every candidate metric is subtler than it looks. The central claim is that every load-predicting metric is a symptom, not a cause — useful because it summarises a chain of underlying queues (disk, CPU, network, locks) rather than instrumenting any one of them. The post walks through replication lag, threads_running, transaction-commit delay, queue length, load average, and connection-pool usage as throttling signals, argues that a throttler should push back on a combination of metrics, and closes by framing two properties every throttler operator has to reason about: (1) introducing a throttler changes the application's behaviour and shifts the observability burden off the database onto the throttler; (2) metric sampling/heartbeat intervals must oversample the acceptable-threshold range (networking-hardware rule of thumb) to avoid stall/release oscillation.

Key takeaways

  1. A throttler pushes back on incoming flow so the system stays healthy. The target workload Noach focuses on is asynchronous, batch, massive operations — ETLs, online DDL, mass purges, resharding — that span minutes to days and cannot be allowed to single-handedly tank production. The job breaks itself into subtasks (e.g. 100 rows at a time from a 10M-row import) and asks the throttler for permission before each one.

  2. Collaborative vs barrier throttlers. Some throttlers assume clients respect a check-then-proceed contract (collaborative — patterns/collaborative-throttler-check-api); others act as barriers between app and database. Either shape rejects when unhealthy; the job backs off and retries. Subtask size is tuned between small enough not to single-handedly tank the database and large enough to make progress against the throttler's per-check overhead.

  3. Not all queries are created equal — so throttle on health, not on rate. General-purpose rate-limiters assume fixed worker cost per item. A database's cost-per-query depends on query scope, hot-spot distribution, page-cache state, data overlap — none of which the throttler can know in advance. The throttler must therefore push back on health signals, not on request rate.

  4. Replication lag is the most-used MySQL throttling signal for a reason. Easy to measure, directly impacts the business (concepts/replication-lag governs failover promotion time, read-your-writes feasibility, and read-replica usefulness). Tools in the MySQL ecosystem (pt-online-schema-change, gh-ost, Vitess) default to replication-lag throttling.

  5. threads_running is a useful symptom but has no stable threshold. (concepts/threads-running-mysql) Acceptable values shift by time-of-day, by product evolution, by query mix. Pick 50, pick 100 — neither number holds water across environments. "An experienced administrator may only need to take one look at this metric … to say 'we're having an issue'", but a throttler cannot encode that intuition as a static threshold.

  6. Every load-predicting metric is a symptom of underlying queues. (concepts/symptom-vs-cause-metric) Replication lag = changelog-event queue delay (network queue + local disk write queue + wait + event processing time). A spike in concurrent queries = commit-queue backup (disk flush) or lock-wait. The metric is useful because it summarises a chain of queues, not because it pinpoints a single bottleneck.

  7. Queue delay is more robust than queue length, when measurable. (concepts/queue-length-vs-wait-time) "A long queue at the airport isn't in itself a bad thing — some queues move quite fast." Wait-time is the user-perceived metric; queue length is a predictor that's cheaper to capture when wait-time measurement is hard.

  8. Load average inherits the threshold-unreliability problem. (concepts/load-average) The classical 1 per CPU rough indicator is system-dependent — "some database deployments famously push their servers to their limits with load averages soaring far above 1 per CPU." Same symptom-vs-cause caveat as threads_running.

  9. Connection-pool exhaustion is the one signal with a natural threshold. (concepts/connection-pool-exhaustion) Pool size was chosen based on database configuration (max connections / buffer-pool sizing / memory per backend), so "pool is 100% used" isn't an arbitrary line — the throttler inherits the existing system configuration as its threshold without introducing a new artificial number.

  10. A throttler should combine metrics, not pick one. (patterns/multi-metric-throttling) "A throttler should be able to push back based on a combination of metrics, and not limit itself to just one metric." Operators set per-metric thresholds; advanced setups should allow adding new metrics programmatically or dynamically.

  11. Introducing a throttler changes the behaviour you're trying to observe. Analogy to the multithreaded-debugging gotcha where adding printf changes the race: a throttler introduces new contention and reduces production contention, which un-masks apps that were running fine because they were piggy- backing on queue head-of-line delay. The root-cause surface shifts from database query traces to throttler check logs — the apps themselves don't do anything; they just ask the throttler, so the database has nothing to tell you about them.

  12. The throttler threshold becomes the steady-state metric value. (concepts/throttler-threshold-as-standard) For a workload massive enough to push against the threshold, the metric graph looks like "the metric goes up to the threshold value, then back down, and pushes again" for the duration of the operation. "It is not uncommon for a system to run one or two operations for very long periods, which means what we consider as the throttling threshold (say, a 5sec replication lag) becomes the actual standard." This is how a healthy throttled system looks.

  13. Metric sampling/heartbeat intervals must oversample the threshold range. (concepts/metric-sampling-interval, concepts/oversampling-metric-interval) Borrowed from networking-hardware rule of thumb: if acceptable replication lag is 5 s, sample every 1–2 s, heartbeat every 1–2 s. Long intervals miss uptick (system degrades before throttler engages), miss recovery (throttler blocks during the whole interval after the metric clears), and cause release-thundering-herd when many throttled jobs see the all-clear at the same moment and push the metric back above threshold in lockstep.

Architectural specifics

Target workload

  • Async, batch, massive operations: ETLs, data imports, online DDL, mass data purges, resharding.
  • Duration: minutes → hours → days.
  • Example: 10M-row import broken into 100-row subtasks; each subtask is a throttler-check + small batch apply + repeat.
  • Discussion also "applies equally" to throttling OLTP production traffic — the architectural choices are the same, only the deployed thresholds differ.

Throttler shapes

Shape Contract
Collaborative Client checks before acting; respects the response. Vitess throttler + gh-ost + pt-online-schema-change are of this shape.
Barrier Sits in the request path; rejects directly.

Either way: if not OK, client backs off for some period and retries.

Subtask-size trade-off

  • Small enough not to single-handedly tank the database.
  • Large enough to make progress net of per-check throttler overhead.
  • No specific number quoted — depends on workload + database size + throttler check cost.

Metric catalogue Noach walks through

Metric Symptom of Has natural threshold?
Replication lag Changelog queue delay (net + disk + wait + processing) Business-derived (failover RTO, read-your-writes tolerance) — yes, per product
threads_running Commit-queue backup, lock-waits, page-cache misses No — time-of-day / product-evolution / query-mix dependent
Transaction commit delay Disk flush latency Hardware-derived (SSD flush ≠ Aurora commit ≠ HDD); operator must know their values
Queue length Predicts wait time where wait time is unmeasurable No — airport-queue analogy
Load average per CPU Run-queue + I/O-wait demand Rough 1 per CPU rule-of-thumb; system-dependent
Pool exhaustion Concurrent operations waiting on connection Yes — natural, set by existing system configuration

Replication-lag measurement: heartbeat injection

  • Deliberate heartbeat events injected on the primary (rate = one per second in the worked example).
  • Captured on replica by matching heartbeat timestamps to wall-clock time at read.
  • Granularity of lag measurement = heartbeat injection interval.
  • Worked example of the sampling-skew problem: heartbeat at 12:00:00.000, sample at 12:00:00.995 captures that heartbeat, client checks at 12:00:01.990 gets a response based on a sample that's now ~2 s old — throttler reading a 2-s-old version of a 1-s-old event.

Steady-state throttled behaviour

For a workload large enough to push against the threshold:

time →
metric    __/\__/\__/\__/\__/\__/\__/\__/\__/\
threshold ────────────────────────────────────

"The operation will be granted access thousands of times or more, and will likewise also be rejected access thousands of times or more. That is how a healthy system looks with a throttler engaged."

Thresholding shifts the metric's meaning: "what we consider as the throttling threshold (say, a 5 sec replication lag) becomes the actual standard."

Observability shift

With a throttler in place:

  • Apps do nothing except ask the throttler.
  • Database has nothing to show about those jobs — no query traces, no lock profiles, no expensive-query logs.
  • Root-cause signal moves to the throttler: what was it rejecting, when, why (which metric's threshold was breached).
  • Canonical instance of patterns/throttler-observability-substitute — the throttler becomes the single source of truth about rejected work.

Multi-job dynamics

When multiple massive operations run concurrently under a throttler:

  • All are throttled while metric is above threshold.
  • All released simultaneously when metric drops — all push the metric up at once, causing re-throttle oscillation.
  • Avoidance: shorter sampling/heartbeat intervals + oversampling the threshold range.

Operational numbers disclosed

  • Worked example: 10M rows imported in 100-row subtasks (= 100,000 subtasks, 100,000 throttler checks minimum).
  • Heartbeat interval worked example: 1 per second.
  • Sample interval worked example: 1 per second.
  • Acceptable replication lag worked example: 5 seconds.
  • Oversampling recommendation: sample interval at 1–2 seconds for a 5-second threshold.
  • "Thousands or more" grants and rejections per operation for a pushed-against-threshold workload.

Caveats

  • Part 1 of a series. Noach flags that singular vs. distributed throttler design + throttler impact on the environment are deferred to later posts. Mechanism-level details (how the Vitess throttler implements its metric-collection loop, how it communicates across shards, how individual clients discover the throttler endpoint) are not in this post.
  • MySQL-centric. Every named metric (replication lag via binlog / GTID, threads_running, InnoDB transaction-commit queue) is a MySQL primitive. Postgres-equivalents named at a concept level only (replication lag is a cross-engine notion).
  • No production numbers. No customer-fleet statistics on throttler-engagement rates, throttler-caused job slowdown measurements, or throttler-threshold-tuning war stories.
  • Collaborative vs barrier trade-off not compared. The post names both shapes but picks collaborative for the discussion. Barrier-shape trade-offs (enforcement strength, performance cost, operator complexity) are not walked.
  • Sampling math is qualitative. The oversampling rule of thumb is stated as borrowed-from-networking but not derived from Nyquist + actual workload-metric spectral analysis. No formal stability-analysis of the throttler control loop.
  • "Metric threshold becomes standard" is stated as observation not as problem. The post doesn't discuss what to do if you don't want your steady-state system running at the threshold ceiling for hours — e.g. should the threshold be a soft signal with a secondary hard-stop? Deferred.
  • Fit with Vitess. The post is authored by the Vitess throttler maintainer but framed as general throttler-design education, not as Vitess-specific internals. Related Vitess specifics — /throttler HTTP endpoint, /throttler/check API, lag / load-avg / custom metric types — are not walked here.

Source

Last updated · 319 distilled / 1,201 read