Skip to content

CONCEPT Cited by 1 source

Model-aware vertical scaling

Definition

Model-aware vertical scaling adjusts how many concurrent requests each pod accepts (target_concurrency) based on the model's actual resource profile under load โ€” not by changing the hardware type. The hardware stays the same; what changes is the admission policy.

This is "vertical" in the sense that it adjusts per-replica capacity, but the mechanism is concurrency tuning rather than CPU/RAM resizing.

Why it matters

Custom ML serving platforms host wildly different model types:

  • A CPU-heavy XGBoost model may only serve 1 request per core
  • An agent workload can run 100s of requests per core
  • A fine-tuned 13B LLM benefits from batching multiple requests

A platform that treats all models identically will either under-utilise lightweight models or over-admit heavy ones. Model-aware vertical scaling learns each model's limit at runtime and right-sizes concurrency as model behavior evolves.

Signals used (Databricks APA)

  1. CPU and GPU utilization, memory utilization, I/O wait
  2. Current latency and queue-depth profile
  3. GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization

Behavioral notes

  • "Most models are homogeneous" โ€” a model's resource profile under the same load stays mostly similar. Vertical scaling earns its keep during onboarding (learning a new model's profile), then goes quiet.
  • Asymmetric: quick to reduce concurrency when stressed (protect latency), slow to increase (avoid oscillation).

Seen in

Last updated ยท 542 distilled / 1,571 read