CONCEPT Cited by 1 source
Model-aware vertical scaling¶
Definition¶
Model-aware vertical scaling adjusts how many concurrent requests
each pod accepts (target_concurrency) based on the model's actual
resource profile under load โ not by changing the hardware type.
The hardware stays the same; what changes is the admission policy.
This is "vertical" in the sense that it adjusts per-replica capacity, but the mechanism is concurrency tuning rather than CPU/RAM resizing.
Why it matters¶
Custom ML serving platforms host wildly different model types:
- A CPU-heavy XGBoost model may only serve 1 request per core
- An agent workload can run 100s of requests per core
- A fine-tuned 13B LLM benefits from batching multiple requests
A platform that treats all models identically will either under-utilise lightweight models or over-admit heavy ones. Model-aware vertical scaling learns each model's limit at runtime and right-sizes concurrency as model behavior evolves.
Signals used (Databricks APA)¶
- CPU and GPU utilization, memory utilization, I/O wait
- Current latency and queue-depth profile
- GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization
Behavioral notes¶
- "Most models are homogeneous" โ a model's resource profile under the same load stays mostly similar. Vertical scaling earns its keep during onboarding (learning a new model's profile), then goes quiet.
- Asymmetric: quick to reduce concurrency when stressed (protect latency), slow to increase (avoid oscillation).
Seen in¶
- sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model โ First canonical disclosure.