SYSTEM Cited by 1 source
Databricks Custom Model Serving¶
Definition¶
Databricks Custom Model Serving is the fully managed real-time inference platform for any model packaged in MLflow. Unlike the foundation-model-serving path (which optimises deeply for known architectures), Custom Model Serving handles heterogeneous model types — from 2 MB scikit-learn classifiers on a single CPU core to fine-tuned 70B LLMs on eight GPUs — on the same platform, without customer-facing tuning knobs.
Architecture¶
Three structural properties:
-
Short, isolated request path — every endpoint is a fully isolated Kubernetes deployment with its own pods and a model-version-specific container image. PoP proxy → auth → shared LB → pod. Observability sidecar per pod.
-
Automatic runtime selection — async Gunicorn MLflow server for classic ML; GPU-optimised engines (vLLM, Triton, or customer-provided) for large models. One uniform serving interface.
-
AutoPilot Pod Autoscaler (APA) — custom Kubernetes controller implementing two-axis autoscaling: horizontal (request-based) + vertical (model-aware concurrency tuning). The heart of the platform.
Operational Envelope (2026-06-10 disclosure)¶
- 300K+ QPS across the platform at low latency.
- 10 → 10K QPS in <60 seconds (model-load-time dependent).
- 99.99% availability in production.
- Customer-reported: 90%+ cost savings vs DIY, up to 2× latency improvement, 100K+ QPS per endpoint.
Relationship to Databricks Model Serving¶
Databricks Model Serving is the broader platform disclosed at architecture depth in the 2026-05-08 Superhuman post (EDS + P2C load balancer, FP8 quantisation, multiprocessing runtime) and the 2026-05-27 reliable-inference post (Axon/Dicer router, model-units cost abstraction). Custom Model Serving appears to be the heterogeneous-model face of the same platform — where the autoscaler must discover each model's resource profile at runtime rather than being pre-tuned for a known architecture.
Seen in¶
- sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — First architecture-level disclosure.