Skip to content

CONCEPT Cited by 1 source

Latency rises before throughput ceiling

In a load-tested system under increasing concurrency, client-observed p99 latency rises before aggregate throughput outright plateaus. The correct saturation signal is "the QPS-per-thread derivative went negative and p99 is climbing faster than p50" — not "QPS flattened". By the time QPS plateaus, p99 has already been degrading for some time.

Canonical observation

Jonah Berquist, PlanetScale, 2022-09-01 (Source: sources/2026-04-21-planetscale-one-million-queries-per-second-with-mysql), on a 16-shard Vitess-on-MySQL cluster under sysbench-tpcc:

"we begin to see diminishing returns as we saturate the resources of each shard. This is noticeable above when the QPS increase was greater between 1024 threads and 2048 threads than it was between 2048 threads and 4096 threads. Similarly, in metrics from vtgate shown below, we see an increase in latency as we max out our throughput. This is particularly evident in our p99 latency."

Two independent signals fire early:

  1. QPS-per-thread derivative: (QPS(2048) - QPS(1024)) > (QPS(4096) - QPS(2048)). Each additional thread adds less than the prior one — load-generator-side signal, requires no server-side instrumentation.
  2. p99 / p50 divergence: VTGate-measured p99 latency "spiking toward the end" while p50 rises more slowly — server-side signal, visible in query-proxy metrics.

Both fire before QPS absolutely plateaus.

Why p99 leads p50

Queueing theory: under bounded-capacity service (each shard has finite CPU + IOPS), increasing concurrency fills per-shard queues. Queueing delay is heavy-tailed — the longest-waiting request in a queue of N items grows faster than the median wait. As the system approaches saturation, a small fraction of requests experience disproportionately large delays (queue pileup behind the slowest request), pulling p99 up before the median moves appreciably.

In database substrates specifically, p99 is sensitive to:

  • Lock contention — rare-but-costly wait paths on hot rows.
  • Buffer pool / cache misses — the tail of queries that hit cold data.
  • Connection pool exhaustion — queries queued at the pool boundary (concepts/connection-pool-exhaustion).
  • Garbage collection / page eviction stalls — infrequent but tail-amplifying.

All four worsen smoothly as utilisation rises, meaning p99 is a near-continuous function of load — not a step function that fires only at outright saturation.

Operational implication: add capacity before plateau

The cheap response to saturation is add shards / add replicas / add CPU when the signal fires, not when QPS flattens. Horizontal sharding substrates like Vitess make this easy: moving from 16 shards to 32 shards is a resharding operation, and the 16-shard configuration's saturation signal is the cue to start the 32-shard reshard before the absolute ceiling is hit. See linear shard-count throughput scaling for the empirical evidence that the next point on the curve is roughly the current capacity.

Contrast the alternative — continuing to push threads against a saturated configuration — which degrades p99 further and eventually triggers client-side timeouts. In a shared-tenant deployment that would be a load-shedding signal rather than a tuning knob.

Diagnostic playbook

Given a load test or production traffic ramp:

  1. Plot QPS vs thread count (or concurrency). Compute the first difference — when the derivative drops meaningfully, the system is near saturation.
  2. Plot p50 vs p99 vs concurrency from the server-side proxy (VTGate for Vitess; pgbouncer / read-replica LB for Postgres). Track the p99/p50 ratio — divergence signals per-request queue pileup.
  3. Prefer VTGate-level metrics over client-side wall-clock latency for sharded deployments: VTGate sees the routing-and-pooling latency, which is what the shard actually delivers; client-side latency mixes in network + client processing and is noisier.
  4. The diagnostic "p99 climbing faster than p50 while QPS still grows" is the earliest cheap signal. Fire the response before the curve plateaus.

Relationship to other saturation signals

  • Little's Law corollary: L = λ × W — as W (latency) rises for fixed λ (arrival rate), the number in system L grows, driving further queue pileup. p99 rising is a direct measurement of W's tail.
  • USE / RED method: this is essentially the Utilisation rising past 70–80% threshold signalling impending saturation, operationalised on the latency axis instead of the CPU-utilisation axis. See patterns/utilization-saturation-errors-triage.
  • Tail-amplification: at scale, one slow shard degrades the p99 of scatter-gather queries disproportionately — see concepts/tail-latency-at-scale for the fan-out-amplification argument.

Seen in

  • sources/2026-04-21-planetscale-one-million-queries-per-second-with-mysql — Jonah Berquist (PlanetScale, 2022-09-01) names the signal canonically with a 16-shard sysbench-tpcc run: QPS gain is larger on the 1024 → 2048 thread step than on the 2048 → 4096 step, and VTGate p99 "spikes toward the end" while the QPS curve is still rising. Canonical wiki illustration that the derivative of QPS-per-thread and the p99/p50 ratio both flag saturation before outright QPS plateau — and that the correct response is to add shards (per the linear-shard-count property) rather than continuing to push threads.
Last updated · 550 distilled / 1,221 read