CONCEPT Cited by 1 source

Model-aware vertical scaling¶

Definition¶

Model-aware vertical scaling adjusts how many concurrent requests each pod accepts (target_concurrency) based on the model's actual resource profile under load — not by changing the hardware type. The hardware stays the same; what changes is the admission policy.

This is "vertical" in the sense that it adjusts per-replica capacity, but the mechanism is concurrency tuning rather than CPU/RAM resizing.

Why it matters¶

Custom ML serving platforms host wildly different model types:

A CPU-heavy XGBoost model may only serve 1 request per core
An agent workload can run 100s of requests per core
A fine-tuned 13B LLM benefits from batching multiple requests

A platform that treats all models identically will either under-utilise lightweight models or over-admit heavy ones. Model-aware vertical scaling learns each model's limit at runtime and right-sizes concurrency as model behavior evolves.

Signals used (Databricks APA)¶

CPU and GPU utilization, memory utilization, I/O wait
Current latency and queue-depth profile
GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization

Behavioral notes¶

"Most models are homogeneous" — a model's resource profile under the same load stays mostly similar. Vertical scaling earns its keep during onboarding (learning a new model's profile), then goes quiet.
Asymmetric: quick to reduce concurrency when stressed (protect latency), slow to increase (avoid oscillation).

Seen in¶

sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — First canonical disclosure.