CONCEPT Cited by 2 sources
Two-axis autoscaling¶
Definition¶
Two-axis autoscaling combines horizontal scaling (add/remove replicas) with vertical scaling (adjust per-replica capacity or concurrency limits) in a single coupled control loop. Each axis does what it is best at:
- Horizontal reacts to traffic changes — fast, availability- focused, prevents queueing.
- Vertical reacts to model/workload characteristics — efficient, resource-aware, prevents waste.
The axes are coupled, not independent: vertical scaling's output
(e.g. target_concurrency) feeds the horizontal scaling formula's
denominator, making the two behave as a single adaptive system.
Why single-axis autoscalers fail¶
| Approach | Strength | Weakness |
|---|---|---|
| Request-based only | Fast reaction | Treats all requests identically; over-provisions or thrashes |
| Resource-based only | Efficient | Utilization metrics trail traffic; by the time the autoscaler fires, p99 damage is done |
Two-axis autoscaling uses the fast signal (requests) for the fast decision (add replicas) and the slower signal (resource utilization) for the efficient decision (adjust concurrency per node).
Canonical wiki instances¶
- APA (Custom Model Serving, 2026-06-10) — horizontal on active concurrency (5s interval)
- vertical on multi-metric model-aware concurrency tuning (30s interval). Sustains 300K+ QPS across heterogeneous models.
- Databricks Serverless Autoscaler (Spark Compute, 2026-05-06) — horizontal (add VMs) + vertical (OOM-aware VM-restart on a larger VM).