CONCEPT Cited by 3 sources

request_concurrency as autoscaling signal¶

Definition¶

request_concurrency is the count of in-flight requests being served by a pod at any given instant. As an autoscaling signal it sits in the family of reactive metrics — the autoscaler reads recent observed concurrency, compares against a per-pod target, and adjusts replica count to converge.

It is the canonical signal for GPU inference serving because:

GPU pods generally serve one or a few requests per forward-pass cycle; concurrency tracks GPU duty cycle directly.
Latency is the primary SLO; concurrency is the leading indicator that latency is about to rise (the queue is filling).
Unlike CPU%, concurrency does not lie when the workload is GPU-saturated — CPU% can be artificially low even at full load.

Canonical wiki disclosure¶

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first canonical disclosure of request_concurrency-driven autoscaling for high-QPS LLM inference:

"To keep the platform cost-optimal for variable traffic patterns, the system autoscales dynamically with customer demand. The autoscaler tracks request_concurrency averaged across pods, with per-pod concurrency targets derived from benchmarking maximum sustainable RPS per replica."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

Two design decisions named in the same disclosure:

Average across pods, not per-pod max. This dampens the autoscaler against a single hot pod and aligns scale decisions with the fleet's average operating point.
Per-pod target derived from benchmarking maximum sustainable RPS per replica. The number is empirical, not analytic — set by load-testing each pod shape until p99 latency starts to climb, then back off slightly. Re-derived when the pod definition changes (model version, GPU class, memory budget).

Why concurrency, not RPS¶

RPS (requests per second) is the more natural-feeling load metric but is structurally less suitable as an autoscaler signal for GPU serving:

Signal	Property	Tradeoff
CPU %	Cheap to measure	Lies under GPU-bound load: looks idle even when the system is at the latency cliff
RPS	Direct load measure	Doesn't account for variable request shape; a pod handling 100 long requests is at higher load than one handling 100 short ones
`request_concurrency`	Tracks the actual in-flight work	Naturally weights heavy requests heavier; close to a leading indicator for latency
p99 latency	Direct SLO measure	Lagging signal — already breached by the time it crosses threshold

Concurrency is the right altitude: heavier than CPU% / RPS at representing real load, but earlier than p99 latency.

What "averaged across pods" means in practice¶

Each pod reports its current in-flight count. The autoscaler computes mean(concurrency_pod_i) over a short window (seconds). It compares to the target — call it T — and:

If mean > T for sustained-up window → scale up
If mean < T for sustained-down window → scale down
Otherwise hold

The asymmetric windows (aggressive-up + conservative-down) are themselves a separate design dimension; see patterns/asymmetric-aggressive-up-conservative-down-autoscaling for the canonical Superhuman shape.

Per-pod target derivation¶

The Databricks post is explicit that the per-pod concurrency target is "derived from benchmarking maximum sustainable RPS per replica". The mapping is straightforward:

target_concurrency = sustainable_RPS × p99_latency

(Little's Law, in the steady state — average in-flight requests equals arrival rate times average residence time, with p99 latency used here as the conservative bound on residence time before the SLO breaks).

In practice the team measures sustainable_RPS by ramping a pod's load until the latency knee, picks a target slightly below the knee, and converts to concurrency. The number is re-measured when the model, GPU class, batch policy, or kernel set changes.

Risks and mitigations¶

Concurrency lags arrival rate during a sharp ramp. A burst can push concurrency past the target before the autoscaler can act. The Superhuman fix: aggressive scale-up (small sustained-up window, fast provisioning).
Concurrency drops under partial outage (errors complete fast), causing spurious scale-down. The Superhuman fix: conservative scale-down + treat error rate as a separate signal.
A single hot pod inflates the average. The mean-not-max choice partially mitigates; a stronger fix is to use a robust statistic (median, trimmed mean) for the autoscaler input.
Concurrency = 0 across all pods during a quiet period creates flapping pressure; scale-down hold-time plus minimum-replica floor is the operational fix.

Seen in¶

sources/2026-05-27-databricks-reliable-llm-inference-at-scale — The next-evolution datum: Databricks explicitly steps beyond request_concurrency for LLM serving in favour of model-unit utilisation ratio — the same in-flight-cost-averaged-across-pods family, but with per-request cost folded into the metric instead of treating every request as cost 1. The structural argument: "a small number of expensive long-context requests can trigger different routing and scaling decisions than many cheap short requests." For uniform-cost workloads, request_concurrency remains correct (and simpler); for non-uniform cost LLM workloads, MU utilisation is the right primitive. Same family, different cost unit. >80% GPU savings vs static-peak provisioning on bursty workloads. See patterns/model-units-utilization-autoscaling.
sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical disclosure of request_concurrency-averaged-across- pods autoscaling at 200K+ QPS on the Databricks Model Serving platform; per-pod target derived from benchmarking maximum sustainable RPS per replica; tuned via joint shadow testing between Databricks and Superhuman.
sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — third source: the Custom Model Serving platform uses active concurrent requests as the horizontal axis of two-axis autoscaling. Here the per-pod target (target_concurrency) is not statically benchmarked but dynamically tuned at runtime by the vertical axis (model-aware vertical scaling) — the platform discovers each model's concurrency limit from hardware utilization + latency signals. At 300K+ QPS across heterogeneous model types.

concepts/model-units / concepts/model-unit-utilization-ratio — the next-evolution signal for non-uniform-cost LLM workloads.
concepts/non-uniform-llm-request-cost — the structural property that breaks request-count-based load metrics at LLM scale.
concepts/reactive-autoscaling — the broader autoscaling family this signal sits in
concepts/predictive-autoscaling — the forecast-based alternative; concurrency-based autoscaling can be composed under a predictor for scheduled spikes
concepts/diurnal-autoscaling-risk — the traffic shape under which concurrency-based scaling has to act fast
concepts/customer-driven-metrics — concurrency is a workload-driven metric, not an infrastructure proxy
concepts/anti-flapping — the operational discipline against which scale-down conservatism is calibrated
concepts/spiky-traffic — the regime where concurrency-driven autoscaling has to be paired with aggressive-up policy
patterns/asymmetric-aggressive-up-conservative-down-autoscaling — the policy shape that pairs with this signal
patterns/model-units-utilization-autoscaling — the next-evolution pattern for LLM workloads.
systems/databricks-model-serving — canonical platform instance
systems/databricks-axon — the LLM router that reads the MU-evolved signal in the next-generation design.
systems/kubernetes — the orchestrator the autoscaler operates on top of