PATTERN Cited by 1 source
Model-aware concurrency tuning¶
Pattern¶
Instead of using a fixed, administrator-chosen concurrency limit per
pod, let the autoscaler discover each model's optimal concurrency at
runtime by observing hardware utilization, latency, and queue depth
under load — then adjust target_concurrency automatically.
The hardware stays the same. What changes is how many concurrent requests each pod is allowed to accept, tuned to the resource profile of the model running on it.
Why¶
Custom model serving platforms host wildly heterogeneous workloads. A fixed concurrency target is either: - Too low for lightweight models → under-utilization, wasted GPUs - Too high for heavy models → queueing, latency violation
Model-aware tuning resolves this without human intervention.
Implementation (Databricks APA)¶
- Observe multi-signal metrics (CPU/GPU util, memory, I/O wait, latency/queue-depth, GPU memory bandwidth, FP16/BF16 FLOPS).
- Only adjust when a metric crosses a stable threshold (per-metric tuning).
- Cap the maximum change per decision cycle (prevent large swings).
- Enforce min/max concurrency limits.
- Run at a 30-second cadence — slower than horizontal scaling's 5-second loop, because it relies on historical metrics and steady- state behavior rather than instantaneous traffic.
Asymmetric direction¶
- Quick to reduce concurrency when a pod shows stress (routing fewer requests to an overloaded replica protects latency immediately).
- Cautious about increasing concurrency (avoid oscillation from brief low-utilization windows).
Production observation¶
"Most models are homogeneous." — once a model's profile is learned during onboarding, the vertical axis goes quiet. It earns its keep during initial deployment and model updates.