Skip to content

CONCEPT Cited by 2 sources

Two-axis autoscaling

Definition

Two-axis autoscaling combines horizontal scaling (add/remove replicas) with vertical scaling (adjust per-replica capacity or concurrency limits) in a single coupled control loop. Each axis does what it is best at:

  • Horizontal reacts to traffic changes — fast, availability- focused, prevents queueing.
  • Vertical reacts to model/workload characteristics — efficient, resource-aware, prevents waste.

The axes are coupled, not independent: vertical scaling's output (e.g. target_concurrency) feeds the horizontal scaling formula's denominator, making the two behave as a single adaptive system.

Why single-axis autoscalers fail

Approach Strength Weakness
Request-based only Fast reaction Treats all requests identically; over-provisions or thrashes
Resource-based only Efficient Utilization metrics trail traffic; by the time the autoscaler fires, p99 damage is done

Two-axis autoscaling uses the fast signal (requests) for the fast decision (add replicas) and the slower signal (resource utilization) for the efficient decision (adjust concurrency per node).

Canonical wiki instances

  1. APA (Custom Model Serving, 2026-06-10) — horizontal on active concurrency (5s interval)
  2. vertical on multi-metric model-aware concurrency tuning (30s interval). Sustains 300K+ QPS across heterogeneous models.
  3. Databricks Serverless Autoscaler (Spark Compute, 2026-05-06) — horizontal (add VMs) + vertical (OOM-aware VM-restart on a larger VM).

Seen in

Last updated · 542 distilled / 1,571 read