Skip to content

SYSTEM Cited by 1 source

Databricks Custom Model Serving

Definition

Databricks Custom Model Serving is the fully managed real-time inference platform for any model packaged in MLflow. Unlike the foundation-model-serving path (which optimises deeply for known architectures), Custom Model Serving handles heterogeneous model types — from 2 MB scikit-learn classifiers on a single CPU core to fine-tuned 70B LLMs on eight GPUs — on the same platform, without customer-facing tuning knobs.

Architecture

Three structural properties:

  1. Short, isolated request path — every endpoint is a fully isolated Kubernetes deployment with its own pods and a model-version-specific container image. PoP proxy → auth → shared LB → pod. Observability sidecar per pod.

  2. Automatic runtime selection — async Gunicorn MLflow server for classic ML; GPU-optimised engines (vLLM, Triton, or customer-provided) for large models. One uniform serving interface.

  3. AutoPilot Pod Autoscaler (APA) — custom Kubernetes controller implementing two-axis autoscaling: horizontal (request-based) + vertical (model-aware concurrency tuning). The heart of the platform.

Operational Envelope (2026-06-10 disclosure)

  • 300K+ QPS across the platform at low latency.
  • 10 → 10K QPS in <60 seconds (model-load-time dependent).
  • 99.99% availability in production.
  • Customer-reported: 90%+ cost savings vs DIY, up to 2× latency improvement, 100K+ QPS per endpoint.

Relationship to Databricks Model Serving

Databricks Model Serving is the broader platform disclosed at architecture depth in the 2026-05-08 Superhuman post (EDS + P2C load balancer, FP8 quantisation, multiprocessing runtime) and the 2026-05-27 reliable-inference post (Axon/Dicer router, model-units cost abstraction). Custom Model Serving appears to be the heterogeneous-model face of the same platform — where the autoscaler must discover each model's resource profile at runtime rather than being pre-tuned for a known architecture.

Seen in

Last updated · 542 distilled / 1,571 read