Skip to content

PATTERN Cited by 1 source

Model-aware concurrency tuning

Pattern

Instead of using a fixed, administrator-chosen concurrency limit per pod, let the autoscaler discover each model's optimal concurrency at runtime by observing hardware utilization, latency, and queue depth under load — then adjust target_concurrency automatically.

The hardware stays the same. What changes is how many concurrent requests each pod is allowed to accept, tuned to the resource profile of the model running on it.

Why

Custom model serving platforms host wildly heterogeneous workloads. A fixed concurrency target is either: - Too low for lightweight models → under-utilization, wasted GPUs - Too high for heavy models → queueing, latency violation

Model-aware tuning resolves this without human intervention.

Implementation (Databricks APA)

  1. Observe multi-signal metrics (CPU/GPU util, memory, I/O wait, latency/queue-depth, GPU memory bandwidth, FP16/BF16 FLOPS).
  2. Only adjust when a metric crosses a stable threshold (per-metric tuning).
  3. Cap the maximum change per decision cycle (prevent large swings).
  4. Enforce min/max concurrency limits.
  5. Run at a 30-second cadence — slower than horizontal scaling's 5-second loop, because it relies on historical metrics and steady- state behavior rather than instantaneous traffic.

Asymmetric direction

  • Quick to reduce concurrency when a pod shows stress (routing fewer requests to an overloaded replica protects latency immediately).
  • Cautious about increasing concurrency (avoid oscillation from brief low-utilization windows).

Production observation

"Most models are homogeneous." — once a model's profile is learned during onboarding, the vertical axis goes quiet. It earns its keep during initial deployment and model updates.

Seen in

Last updated · 542 distilled / 1,571 read