Skip to content

PATTERN Cited by 1 source

Throttler as observability substitute

Problem

Introducing a throttler into a production system changes the behaviour you were trying to observe. The change is of the same shape as the multithreaded- debugging gotcha where adding printf statements perturbs the race condition they were meant to diagnose:

"Many software developers will be familiar with the next scenario: you have a multithreaded or otherwise highly concurrent app. There's a bug, likely a race condition or a synchronization issue, and you wish to find it. You choose to print informative debug messages to standard output, and hope to find the bug by examining the log. Alas, when you do so, the bug does not reproduce. Or it may manifest elsewhere."

— Shlomi Noach, Anatomy of a Throttler, part 1

The specific shift for a throttled database:

  1. Contention drops on the database. Apps that were running "fine" were in fact piggy-backing on head-of-line queue delay — they looked OK only because no one went first.
  2. New contention appears. Throttler rejection becomes a new hot-path decision point; apps that were never rate-limited now experience retry loops.
  3. Observability on the database drops to near zero for throttled workloads. The app isn't doing anything — it's just asking the throttler. "The database now tells you little to nothing."

Solution

Shift the observability surface from the database to the throttler:

  1. Log every throttler check. Which client, which metric, which value, which threshold, accepted or rejected.
  2. Emit structured metrics from the throttler itself: rejection rate per client, per metric, per threshold breach.
  3. Provide the rejected-reason field on every rejection so the client (and operators reading client logs) know which metric caused the rejection.
  4. Rate-limit the observability. High-frequency collaborative clients produce a check-per-subtask, which at migration scale is millions of checks — aggregation is required.

Why this is a pattern, not just "good logging"

The throttler-as-observability-substitute insight is architectural, not cosmetic. It reframes what the observability system is:

  • Before throttling: database query logs + slow query logs + lock-wait metrics + deadlock graphs are the root-cause surface.
  • After throttling: those signals go quiet for throttled traffic. Throttler check logs become the primary root-cause surface for throttled work. Database signals only matter for the un-throttled (e.g. OLTP) traffic.

Throttler logs that look like operational noise ("X requests rejected, Y requests accepted") are actually the main observability artefact for the throttled workload class.

What the throttler logs must contain

Field Why
Client identity (app / service) Correlate with client-side retry / progress logs.
Timestamp Time-series debugging.
Scope (shard / region / tenant) Isolate degradations to a subset.
Decision (accept / reject) Basic audit.
Reason (metric name + value + threshold) Root-cause for the rejection.
Check latency Detect throttler-itself slowness.

Without the reason field, the logs degenerate to an accept/reject count that matches what the client already knows from its own retry loop.

Implication for on-call

An on-call engineer responding to "migration job X is running slowly" cannot interrogate the database — there is nothing there to interrogate. Instead:

  1. Query the throttler's rejection logs for job X.
  2. Identify which metric is triggering rejections.
  3. Go to the underlying cause for that metric (e.g. replication lag → which queue in the replication chain is backing up).

The throttler log replaces the database log as the on-call starting point.

Noach's concluding framing

"All of a sudden, there is less contention on the database, and certain apps that used to run just fine, exhibit contention/latency behavior. The appearance of a new job suddenly affects the progress of another. But where previously you could clearly analyze database queries to find the root cause, the database now tells you little to nothing. It's now down to the throttler to give you that information. But even the throttler is limited, because all the apps do is to check the throttler for health status. They do not yet actually do anything."

The last sentence hints at the forward pointer: a throttler that only emits rejection counts tells you what was throttled but not what the throttled work was trying to do. Subsequent posts in the series address this — capturing workload intent alongside throttler decisions so the observability story stays intact after throttling is installed.

Seen in

Last updated · 319 distilled / 1,201 read