Skip to content

DATABRICKS 2026-06-10

Read original ↗

AI Serving Platform That Adapts to Your Model

Summary

Databricks describes the architecture of Custom Model Serving — their fully managed real-time inference platform for any model packaged in MLflow. The post focuses on how the platform eliminates the "ML Stack Tax" (the operational cost of manually tuning serving infrastructure per model) through three structural mechanisms: a short isolated request path, automatic runtime selection, and an AutoPilot Pod Autoscaler (APA) that adapts to both the model's resource profile and its traffic simultaneously. The platform sustains 300K+ QPS at low latency across a wide variety of models — from 2 MB scikit-learn classifiers on a single CPU core to fine-tuned 70B LLMs on eight GPUs — without any customer-facing tuning knobs.

Key Takeaways

  1. The ML Stack Tax is structural: every new model and traffic shift requires re-profiling and re-tuning replica count, per-replica concurrency, and autoscaling thresholds. At scale this becomes a dedicated team whose "whole job is keeping models alive." The platform eliminates this by making the infrastructure adapt to the model, not the reverse.

  2. Short, isolated request path: every endpoint is a fully isolated Kubernetes deployment with its own pods and model-version-specific container image. One endpoint's failures cannot affect another's. The path is PoP proxy → shared load balancer → pod, deliberately minimal for latency.

  3. Automatic runtime selection: inside each pod, the model runs on the inference engine best suited to its type — async Gunicorn MLflow server for classic ML, and GPU-optimized engines (vLLM, Triton, or customer-provided) for large models — behind one uniform serving interface.

  4. Two-axis autoscaling (APA): the central innovation. Traditional autoscalers do either request-based (fast but wasteful) or resource-based (efficient but slow). APA combines both:

  5. Horizontal scaling reacts to requests (active concurrent requests per endpoint) — adds/removes replicas the moment demand shifts. Decides every 5 seconds based on the last 20 seconds of traffic.
  6. Model-aware vertical scaling reacts to model characteristics — periodically adjusts target_concurrency (how many concurrent requests each pod accepts) based on hardware metrics (CPU/GPU utilization, memory, I/O wait), latency/queue-depth profile, and GPU-specific metrics (memory bandwidth, FP16/BF16 FLOPS utilization). Decides every 30 seconds using historical metrics.

  7. The two axes are coupled: vertical scaling's concurrency output feeds into the horizontal scaling formula's target_concurrency denominator. Horizontal ensures availability; vertical ensures efficiency. Together they avoid the "fast-but-wasteful vs efficient-but-slow" false choice.

  8. Asymmetric scale-up/scale-down (aggressive/conservative):

  9. Scale-up is aggressive: scrapes incoming requests every 1 second, decides every 5 seconds based on past 20 seconds. Can go 10 to 10K QPS in <60 seconds (model-load-time dependent).
  10. Scale-down is conservative: decides every 5 seconds but considers traffic over the last ~5 minutes before removing replicas.
  11. Rationale: "The cost of premature scale-down (a cold start at the worst possible moment) outweighs the cost of keeping a few idle replicas temporarily."

  12. Vertical concurrency direction is also asymmetric: quick to reduce concurrency when a pod shows stress (protect latency), slow to increase (avoid oscillation). A 30-second interval, slower than the 5-second horizontal loop.

  13. Cold-start mitigation via warm node pools: a predictive algorithm maintains pre-provisioned nodes with the base runtime image already pulled. When APA adds a replica, it picks from this pool — only the model download remains. Databricks doesn't charge for warm-pool capacity.

  14. Fast model download: model containers stored in a hot cache layer and pulled in parallel chunks. Config-only changes (metadata, routing rules) apply without pod restart.

  15. Provisioned concurrency: for latency-critical endpoints, a minimum concurrency floor keeps pods fully warm and loaded.

  16. Zero-downtime updates: all new-version pods are up and ready before traffic moves off old pods.

Operational Numbers

Metric Value
Peak QPS across the platform 300K+
Scale-up speed (10 → 10K QPS) < 60 seconds
Cost savings vs DIY (some customers) 90%+
Latency improvement (p99 & p50) up to 2×
Customer scale (some customers) 100K+ QPS per endpoint
Availability 99.99%
Horizontal scaling decision interval every 5 seconds
Horizontal scale-up lookback 20 seconds
Horizontal scale-down lookback ~5 minutes
Vertical scaling decision interval every 30 seconds
Request scrape interval every 1 second

Architecture Diagram (Conceptual)

Request → PoP Proxy → Auth → Shared LB → [Isolated K8s Deployment]
                                              ├── Pod 1 (model v2)
                                              ├── Pod 2 (model v2)
                                              └── ...
                                          [Observability Sidecar]

AutoPilot Pod Autoscaler (APA):
  ┌─────────────────┐     ┌──────────────────────┐
  │ Horizontal axis │ ←── │ target_concurrency    │ ←── Vertical axis
  │ (5s interval)   │     │ (from vertical tuning)│     (30s interval)
  │ Watches: active │     └──────────────────────┘     Watches: CPU/GPU
  │ concurrency     │                                   utilization,
  │ + queue depth   │                                   latency, queue,
  └─────────────────┘                                   memory bandwidth

Safeguards Against Metric Noise (Vertical Scaling)

  1. Concurrency adjusted only when a metric crosses a stable threshold (tuned per metric).
  2. Maximum change in concurrency capped per decision cycle.
  3. Min/max concurrency limits always enforced.
  4. Vertical cadence (30s) deliberately slower than horizontal (5s) — relies on historical metrics, not instantaneous traffic.

Caveats

  • Tier-3 source (Databricks Blog); product-engineering post with self-reported metrics.
  • "300K+ QPS" and "90%+ cost savings" are platform-aggregate and best-case-customer numbers respectively; workload-specific.
  • No disclosure of the predictive algorithm for warm-pool sizing.
  • No mechanism-level disclosure of how "parallel chunks" model download works.
  • No per-model-type benchmarks (classic ML vs LLM) — all treated as "the platform handles it."
  • Relationship to the prior Databricks Model Serving architecture (EDS + P2C + request_concurrency from the Superhuman post) vs the Custom Model Serving architecture here is not explicitly articulated — they appear to be sibling platforms or evolution stages.
  • The article ends with a hiring link — some marketing framing.

Source

Last updated · 542 distilled / 1,571 read