SYSTEM Cited by 1 source

Databricks AutoPilot Pod Autoscaler (APA)¶

Definition¶

AutoPilot Pod Autoscaler (APA) is a custom Kubernetes controller at the center of Databricks Custom Model Serving. It continuously collects signals from the load balancer (active concurrency, queue depth) and from the pods themselves (CPU/GPU utilization, GPU memory, memory bandwidth, FP16/BF16 FLOPS utilization), and turns them into scaling decisions across two coupled axes.

Two-Axis Design¶

Horizontal axis — request-based scaling (fast)¶

Watches active concurrent requests per endpoint and adds/removes replicas. Formula follows the Kubernetes HPA:

desired_replicas = ceil(current_requests / target_concurrency)

Decision interval: 5 seconds
Scale-up lookback: 20 seconds (aggressive)
Scale-down lookback: ~5 minutes (conservative)
Request scrape: every 1 second

Vertical axis — model-aware concurrency tuning (efficient)¶

Periodically evaluates multi-signal metrics to determine how much load a single replica can actually handle. Adjusts target_concurrency — not the hardware type. Metrics include:

CPU and GPU utilization, memory utilization, I/O wait
Current latency and queue-depth profile
GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization
Decision interval: 30 seconds
Uses historical metrics (not instantaneous traffic)

Coupling¶

Vertical scaling's target_concurrency output feeds the horizontal formula's denominator. The two axes are coupled by design — not independent.

Safeguards¶

Concurrency changes only on stable threshold crossings (per-metric tuning).
Maximum change cap per decision cycle.
Min/max concurrency limits always enforced.
Cadence separation (30s vertical vs 5s horizontal) prevents interference.

Asymmetric Philosophy¶

Horizontal scale-up: aggressive (react to spikes in seconds)
Horizontal scale-down: conservative (wait ~5 minutes)
Vertical concurrency reduction: quick (protect latency)
Vertical concurrency increase: cautious (avoid oscillation)

"The cost of premature scale-down (a cold start at the worst possible moment) outweighs the cost of keeping a few idle replicas temporarily."

Performance¶

10 → 10K QPS in <60 seconds (model-load-time dependent)
Customers reported up to 5× reduction in queueing and 429s during spikes with the aggressive scale-up policy.

Relationship to Other Databricks Autoscalers¶

systems/databricks-serverless-autoscaler — two-axis autoscaler for Spark compute (horizontal + vertical via OOM-aware VM restart). APA is the model-serving sibling with concurrency-tuning as the vertical axis.
Model-units-based autoscaling (from the Axon/Dicer LLM router) — operates at the LLM-specific tier with MU utilisation as the signal; APA is model-agnostic.

Seen in¶

sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — First architecture-level disclosure.