Skip to content

CONCEPT Cited by 3 sources

request_concurrency as autoscaling signal

Definition

request_concurrency is the count of in-flight requests being served by a pod at any given instant. As an autoscaling signal it sits in the family of reactive metrics — the autoscaler reads recent observed concurrency, compares against a per-pod target, and adjusts replica count to converge.

It is the canonical signal for GPU inference serving because:

  • GPU pods generally serve one or a few requests per forward-pass cycle; concurrency tracks GPU duty cycle directly.
  • Latency is the primary SLO; concurrency is the leading indicator that latency is about to rise (the queue is filling).
  • Unlike CPU%, concurrency does not lie when the workload is GPU-saturated — CPU% can be artificially low even at full load.

Canonical wiki disclosure

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first canonical disclosure of request_concurrency-driven autoscaling for high-QPS LLM inference:

"To keep the platform cost-optimal for variable traffic patterns, the system autoscales dynamically with customer demand. The autoscaler tracks request_concurrency averaged across pods, with per-pod concurrency targets derived from benchmarking maximum sustainable RPS per replica."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

Two design decisions named in the same disclosure:

  1. Average across pods, not per-pod max. This dampens the autoscaler against a single hot pod and aligns scale decisions with the fleet's average operating point.
  2. Per-pod target derived from benchmarking maximum sustainable RPS per replica. The number is empirical, not analytic — set by load-testing each pod shape until p99 latency starts to climb, then back off slightly. Re-derived when the pod definition changes (model version, GPU class, memory budget).

Why concurrency, not RPS

RPS (requests per second) is the more natural-feeling load metric but is structurally less suitable as an autoscaler signal for GPU serving:

Signal Property Tradeoff
CPU % Cheap to measure Lies under GPU-bound load: looks idle even when the system is at the latency cliff
RPS Direct load measure Doesn't account for variable request shape; a pod handling 100 long requests is at higher load than one handling 100 short ones
request_concurrency Tracks the actual in-flight work Naturally weights heavy requests heavier; close to a leading indicator for latency
p99 latency Direct SLO measure Lagging signal — already breached by the time it crosses threshold

Concurrency is the right altitude: heavier than CPU% / RPS at representing real load, but earlier than p99 latency.

What "averaged across pods" means in practice

Each pod reports its current in-flight count. The autoscaler computes mean(concurrency_pod_i) over a short window (seconds). It compares to the target — call it T — and:

  • If mean > T for sustained-up window → scale up
  • If mean < T for sustained-down window → scale down
  • Otherwise hold

The asymmetric windows (aggressive-up + conservative-down) are themselves a separate design dimension; see patterns/asymmetric-aggressive-up-conservative-down-autoscaling for the canonical Superhuman shape.

Per-pod target derivation

The Databricks post is explicit that the per-pod concurrency target is "derived from benchmarking maximum sustainable RPS per replica". The mapping is straightforward:

target_concurrency = sustainable_RPS × p99_latency

(Little's Law, in the steady state — average in-flight requests equals arrival rate times average residence time, with p99 latency used here as the conservative bound on residence time before the SLO breaks).

In practice the team measures sustainable_RPS by ramping a pod's load until the latency knee, picks a target slightly below the knee, and converts to concurrency. The number is re-measured when the model, GPU class, batch policy, or kernel set changes.

Risks and mitigations

  • Concurrency lags arrival rate during a sharp ramp. A burst can push concurrency past the target before the autoscaler can act. The Superhuman fix: aggressive scale-up (small sustained-up window, fast provisioning).
  • Concurrency drops under partial outage (errors complete fast), causing spurious scale-down. The Superhuman fix: conservative scale-down + treat error rate as a separate signal.
  • A single hot pod inflates the average. The mean-not-max choice partially mitigates; a stronger fix is to use a robust statistic (median, trimmed mean) for the autoscaler input.
  • Concurrency = 0 across all pods during a quiet period creates flapping pressure; scale-down hold-time plus minimum-replica floor is the operational fix.

Seen in

Last updated · 542 distilled / 1,571 read