SYSTEM Cited by 1 source

Databricks Custom Model Serving¶

Definition¶

Databricks Custom Model Serving is the fully managed real-time inference platform for any model packaged in MLflow. Unlike the foundation-model-serving path (which optimises deeply for known architectures), Custom Model Serving handles heterogeneous model types — from 2 MB scikit-learn classifiers on a single CPU core to fine-tuned 70B LLMs on eight GPUs — on the same platform, without customer-facing tuning knobs.

Architecture¶

Three structural properties:

Short, isolated request path — every endpoint is a fully isolated Kubernetes deployment with its own pods and a model-version-specific container image. PoP proxy → auth → shared LB → pod. Observability sidecar per pod.
Automatic runtime selection — async Gunicorn MLflow server for classic ML; GPU-optimised engines (vLLM, Triton, or customer-provided) for large models. One uniform serving interface.
AutoPilot Pod Autoscaler (APA) — custom Kubernetes controller implementing two-axis autoscaling: horizontal (request-based) + vertical (model-aware concurrency tuning). The heart of the platform.

Operational Envelope (2026-06-10 disclosure)¶

300K+ QPS across the platform at low latency.
10 → 10K QPS in <60 seconds (model-load-time dependent).
99.99% availability in production.
Customer-reported: 90%+ cost savings vs DIY, up to 2× latency improvement, 100K+ QPS per endpoint.

Relationship to Databricks Model Serving¶

Databricks Model Serving is the broader platform disclosed at architecture depth in the 2026-05-08 Superhuman post (EDS + P2C load balancer, FP8 quantisation, multiprocessing runtime) and the 2026-05-27 reliable-inference post (Axon/Dicer router, model-units cost abstraction). Custom Model Serving appears to be the heterogeneous-model face of the same platform — where the autoscaler must discover each model's resource profile at runtime rather than being pre-tuned for a known architecture.

Seen in¶

sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model — First architecture-level disclosure.