Skip to content

SYSTEM Cited by 1 source

Databricks AutoPilot Pod Autoscaler (APA)

Definition

AutoPilot Pod Autoscaler (APA) is a custom Kubernetes controller at the center of Databricks Custom Model Serving. It continuously collects signals from the load balancer (active concurrency, queue depth) and from the pods themselves (CPU/GPU utilization, GPU memory, memory bandwidth, FP16/BF16 FLOPS utilization), and turns them into scaling decisions across two coupled axes.

Two-Axis Design

Horizontal axis — request-based scaling (fast)

Watches active concurrent requests per endpoint and adds/removes replicas. Formula follows the Kubernetes HPA:

desired_replicas = ceil(current_requests / target_concurrency)
  • Decision interval: 5 seconds
  • Scale-up lookback: 20 seconds (aggressive)
  • Scale-down lookback: ~5 minutes (conservative)
  • Request scrape: every 1 second

Vertical axis — model-aware concurrency tuning (efficient)

Periodically evaluates multi-signal metrics to determine how much load a single replica can actually handle. Adjusts target_concurrencynot the hardware type. Metrics include:

  1. CPU and GPU utilization, memory utilization, I/O wait
  2. Current latency and queue-depth profile
  3. GPU-specific: memory bandwidth, FP16/BF16 FLOPS utilization

  4. Decision interval: 30 seconds

  5. Uses historical metrics (not instantaneous traffic)

Coupling

Vertical scaling's target_concurrency output feeds the horizontal formula's denominator. The two axes are coupled by design — not independent.

Safeguards

  1. Concurrency changes only on stable threshold crossings (per-metric tuning).
  2. Maximum change cap per decision cycle.
  3. Min/max concurrency limits always enforced.
  4. Cadence separation (30s vertical vs 5s horizontal) prevents interference.

Asymmetric Philosophy

  • Horizontal scale-up: aggressive (react to spikes in seconds)
  • Horizontal scale-down: conservative (wait ~5 minutes)
  • Vertical concurrency reduction: quick (protect latency)
  • Vertical concurrency increase: cautious (avoid oscillation)

"The cost of premature scale-down (a cold start at the worst possible moment) outweighs the cost of keeping a few idle replicas temporarily."

Performance

  • 10 → 10K QPS in <60 seconds (model-load-time dependent)
  • Customers reported up to 5× reduction in queueing and 429s during spikes with the aggressive scale-up policy.

Relationship to Other Databricks Autoscalers

  • systems/databricks-serverless-autoscaler — two-axis autoscaler for Spark compute (horizontal + vertical via OOM-aware VM restart). APA is the model-serving sibling with concurrency-tuning as the vertical axis.
  • Model-units-based autoscaling (from the Axon/Dicer LLM router) — operates at the LLM-specific tier with MU utilisation as the signal; APA is model-agnostic.

Seen in

Last updated · 542 distilled / 1,571 read