SYSTEM Cited by 1 source
Databricks AutoPilot Pod Autoscaler (APA)¶
Definition¶
AutoPilot Pod Autoscaler (APA) is a custom Kubernetes controller at the center of Databricks Custom Model Serving. It continuously collects signals from the load balancer (active concurrency, queue depth) and from the pods themselves (CPU/GPU utilization, GPU memory, memory bandwidth, FP16/BF16 FLOPS utilization), and turns them into scaling decisions across two coupled axes.
Two-Axis Design¶
Horizontal axis — request-based scaling (fast)¶
Watches active concurrent requests per endpoint and adds/removes replicas. Formula follows the Kubernetes HPA:
- Decision interval: 5 seconds
- Scale-up lookback: 20 seconds (aggressive)
- Scale-down lookback: ~5 minutes (conservative)
- Request scrape: every 1 second
Vertical axis — model-aware concurrency tuning (efficient)¶
Periodically evaluates multi-signal metrics to determine how much
load a single replica can actually handle. Adjusts
target_concurrency — not the hardware type. Metrics include:
- CPU and GPU utilization, memory utilization, I/O wait
- Current latency and queue-depth profile
-
GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization
-
Decision interval: 30 seconds
- Uses historical metrics (not instantaneous traffic)
Coupling¶
Vertical scaling's target_concurrency output feeds the horizontal
formula's denominator. The two axes are coupled by design — not
independent.
Safeguards¶
- Concurrency changes only on stable threshold crossings (per-metric tuning).
- Maximum change cap per decision cycle.
- Min/max concurrency limits always enforced.
- Cadence separation (30s vertical vs 5s horizontal) prevents interference.
Asymmetric Philosophy¶
- Horizontal scale-up: aggressive (react to spikes in seconds)
- Horizontal scale-down: conservative (wait ~5 minutes)
- Vertical concurrency reduction: quick (protect latency)
- Vertical concurrency increase: cautious (avoid oscillation)
"The cost of premature scale-down (a cold start at the worst possible moment) outweighs the cost of keeping a few idle replicas temporarily."
Performance¶
- 10 → 10K QPS in <60 seconds (model-load-time dependent)
- Customers reported up to 5× reduction in queueing and 429s during spikes with the aggressive scale-up policy.
Relationship to Other Databricks Autoscalers¶
- systems/databricks-serverless-autoscaler — two-axis autoscaler for Spark compute (horizontal + vertical via OOM-aware VM restart). APA is the model-serving sibling with concurrency-tuning as the vertical axis.
- Model-units-based autoscaling (from the Axon/Dicer LLM router) — operates at the LLM-specific tier with MU utilisation as the signal; APA is model-agnostic.
Seen in¶
- sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — First architecture-level disclosure.