Skip to content

SYSTEM Cited by 5 sources

Databricks Model Serving

Databricks Model Serving is the managed real-time inference product on the Databricks Data Intelligence Platform. The 2026-05-08 joint Databricks / Superhuman post is the wiki's first canonical disclosure of the platform's internal architecture — specifically the joint-engineering work required to host Superhuman's grammar- correction LLM at peak 200,000+ QPS, sub-1-second p99 latency, and 4-9's (99.99%) reliability on this platform.

Before this post, the wiki carried Databricks Model Serving only as a name-drop in product-marketing materials. After this post it is documented at platform-internals depth, in two layers: platform infrastructure (load balancing, autoscaling, container startup) and runtime engine (FP8 quantisation, multiprocessing, async scheduling).

Architecture (as disclosed)

The 2026-05-08 post canonicalises the platform as a two-layer co-engineered stack:

Platform layer — handles fleet-level scale

  • Endpoint Discovery Service (EDS) — lightweight control plane that watches the Kubernetes API for changes to Services and EndpointSlices. The same control plane Databricks built for intra-cluster load balancing (2025-10-01 Intelligent Kubernetes load balancing post), here promoted to the load-balancing substrate for managed external inference.
  • Custom Power-of-Two-Choices load balancer — drives client-side LB off EDS state. For each request, two candidate pods are sampled and traffic is routed to whichever has fewer active requests. Replaces the default Kubernetes round-robin that "degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency." Mitzenmacher's Power-of-Two-Choices is the cited algorithmic foundation. (See patterns/kubernetes-api-driven-custom-load-balancer.)
  • Autoscaler tracking request_concurrency averaged across pods, with per-pod concurrency targets "derived from benchmarking maximum sustainable RPS per replica." The scaling strategy is intentionally asymmetric"scale-up is aggressive and responsive, while scale-down is conservative, to prevent the flapping that causes latency spikes." Tuned via "joint shadow testing between Superhuman and Databricks."
  • Container image accelerationlazy-loading container filesystem backed by a block-device image format with 4MB sectors. Adopted from Databricks' prior serverless-compute work ("Booting Databricks VMs 7× faster"). Cuts pod start time from "several minutes" to "a few seconds", letting the autoscaler add dozens of pods during a traffic ramp without users seeing latency spikes.

Runtime layer — handles per-pod throughput

  • FP8 quantisation engine — supports attention projections (Q, K, V, output) and MLP projections on the FP8 path while leaving the KV-cache at higher precision. Single largest per-pod throughput win in the Superhuman migration ("up to 30% increase in per-pod QPS"). (See concepts/selective-fp8-quantization.)
  • Per-channel FP8 scaling kernels — separate scale factor per output channel of each linear layer, rather than the off-the-shelf per-tensor scaling. "Preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals." (See concepts/per-channel-vs-per-tensor-fp8-scaling.)
  • Hybrid-precision serving engine"designed… to support hybrid-precision inference from the start, so that if any layer group proved too quality-sensitive under quantization, we could keep it in higher precision without changing the overall serving architecture." The runtime ships with a flag to toggle attention quantisation on and off so customer ML teams can measure quality impact directly. (See patterns/toggleable-hybrid-precision-quantization.)
  • Multiprocessing RPC server — multiple CPU processes prepare and dispatch work to the GPU in parallel, eliminating the single-process serialisation bottleneck for the CPU-bound regime small fast LLMs hit. "Delivered another 20% additional throughput." (See patterns/multiprocessing-runtime-for-cpu-bound-serving.)
  • C++ tensor manipulation in the CUDA-graph decode step"replaced Python-level tensor slicing, copying, and filling at the start of each CUDA graph decode step with a single C++ call. We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead." Few-percentage-point gain.
  • Async CPU-GPU scheduler — CPU-side post-processing for batch N runs concurrently with the GPU forward pass for batch N+1. "Rather than finishing all post-processing for batch N before launching batch N+1, the scheduler dispatches N+1 immediately and handles N's post-processing in parallel." Few-percentage-point gain.

LLM-specific architecture (2026-05-27)

The 2026-05-27 Reliable LLM Inference at Scale post canonicalises a second, LLM-specific layer of the Databricks Model Serving stack that is structurally distinct from the EDS+P2C+request_concurrency shape disclosed for the Superhuman 200K-QPS workload. At 125T+ tokens/month scale across frontier OS (Kimi, Qwen) and proprietary (OpenAI, Gemini, Claude) models, Databricks runs:

Data plane

  • Axon (named publicly for the first time) — the LLM data-plane router. Built on Dicer. Routes requests across replicas of the same model; load metric is server load measured in model units, not active-request count.
  • Stateful (sticky) sessions — workload requests pin to a Dicer-assigned subset of pods. Two purposes: (a) KV prefix-cache locality for coding agents and other latency-sensitive workloads; (b) bounded blast radius under failure. See patterns/stateful-llm-session-routing.
  • Inference runtime — open-source engines (vLLM and similar) and proprietary in-house engines on frontier GPUs.

Control plane

  • Rate limiting at the data-plane edge (denominated in MUs, not requests).
  • Capacity management allocating MU-budgets per workload — the unit-of-account that turns multi-tenant LLM serving into VM-equivalent guaranteed capacity instead of best-effort capacity. See concepts/multi-tenant-llm-capacity-allocation.
  • Autoscaler on model-unit utilisation ratio — averaged across pods. Scale-up when utilisation approaches the per-replica MU capacity; scale-down conservatively. Same control loop runs across every model deployment — "model- agnostic scaling infrastructure." >80% GPU savings vs static- peak provisioning on bursty workloads. See patterns/model-units-utilization-autoscaling.

Runtime reliability

  • Silent-hang detection via periodic minimal end-to-end black-box health checks; failed probe triggers Kubernetes liveness probe to restart pod. <5-minute detect→kill→recover cycle.
  • Health-check priority scheduling — probes get the highest scheduling priority inside the engine, so they complete even under heavy load. False liveness-probe failures: several/week → zero.
  • Multimodal CPU bottleneck fixes: Torchvision over PIL image processor (10× preprocessing speedup) + OMP_NUM_THREADS fix (avoid container CPU throttling from thread oversubscription). Combined: >3× RPS jump on same hardware.

Why this layer is structurally distinct from the Superhuman face

The 2026-05-08 Superhuman face (EDS + P2C-with-active-requests + request_concurrency-averaged-across-pods autoscaler) is correct for CPU-style high-QPS small-model serving — Superhuman's grammar- correction LLM is small enough that the request-cost distribution is roughly uniform within the workload. The 2026-05-27 face is the LLM-specific layer: when the model is large enough and the request cost varies enough (long-context, multimodal, agentic), the cost-based routing + cost-based autoscaling primitives become structurally necessary. Databricks runs both faces; which one a deployment uses depends on the workload regime.

The post explicitly retires P2C-with-active-requests at LLM scale: "LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach." See concepts/power-of-two-choices.

(Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale)

Operating envelope (Superhuman workload, disclosed)

Metric Value
Peak QPS served 200,000+
End-to-end p99 latency < 1 second
Reliability target 4 9's (99.99%)
Per-pod QPS, pre-optimisation (H100) 750
Per-pod QPS, post-optimisation (H100) 1,200 (+60%)
Hardware class (post-migration) NVIDIA H100
Hardware class (pre-migration, customer's DIY stack) NVIDIA L40S
Per-request shape (Superhuman) ~50 input tokens + ~50 output tokens

The 200K QPS / sub-1s p99 / 4-9's claims apply to the Superhuman grammar-correction endpoint specifically, not Databricks Model Serving in general. The post is careful about this scope.

Division of responsibilities

The post canonicalises a clean platform/customer split:

"Using a managed inference service does not have to mean giving up control. Superhuman retains full ownership of model training, quantization, and quality standards, while Databricks maintains runtime performance and platform reliability. This division of responsibilities works well with shared SLOs, joint quality validation and progressive load testing when onboarding onto the Databricks platform."

Specifically:

  • Customer (Superhuman) owns: model training, FP8 prequantisation of the checkpoint (using vLLM's online quantisation library), quality-bar definition, internal eval harness.
  • Databricks owns: runtime engine, kernel implementation, serving infrastructure (LB, autoscaler, container runtime), platform reliability, joint shadow testing.
  • Shared: SLO definition (sub-1s p99, zero quality regression), joint quality validation, progressive load-test plan.

Seen in

  • sources/2026-05-13-databricks-the-rosetta-stone-of-cps-clarotys-ai-powered-libraryCustom-domain-specific-embedding-endpoint face. Second canonical Model Serving face on the wiki: not just real-time LLM inference at 200K+ QPS sub-1s p99 (the Superhuman face) but the substrate for hosting domain-specific embedding models as custom endpoints when generic embeddings underspecify the domain. "To tackle the nuances of healthcare and OT, generic embeddings were insufficient for the level of precision we require. We identified that for the 'Universal Translator' to truly succeed, generic RAG architectures must evolve into domain-specific frameworks. We currently bridge this gap by deploying best-in-class medical embedding models as custom endpoints using Databricks Model Serving. However, as we look to the future, we see fine-tuning these models as the next logical step to ensure our agents understand the most obscure industrial dialects with deterministic accuracy." Two structural contributions: (a) Model Serving is positioned as a flexible custom-endpoint substrate, not just a managed-LLM endpoint; (b) the explicit fine-tuning roadmap names domain-specific fine-tuning as the next step to deterministic accuracy on long-tail industrial dialects. The Claroty source also surfaces the scale-to-zero gap for vector endpoints — a related but distinct cost-efficiency observation about Databricks vector search rather than Model Serving itself. Composes with patterns/hybrid-classical-er-plus-genai in the GenAI track of systems/claroty-cps-library.

Seen in

  • sources/2026-05-27-databricks-reliable-llm-inference-at-scaleLLM-specific platform face. Third canonical Model Serving face after the Superhuman 200K-QPS face and the Claroty custom-embedding-endpoint face. Discloses Axon (the LLM data- plane router, named publicly for the first time), model units as the LLM load currency, cost-based routing on Dicer keyed on MUs, stateful (sticky) sessions for cache- locality + blast-radius, model-unit-utilisation autoscaling delivering >80% GPU savings on bursty workloads, prioritised black-box health checks for silent-hang detection (<5-min detect→recover, several-per-week → zero false positives), and the multimodal CPU bottleneck fixes (Torchvision over PIL, OMP_NUM_THREADS) delivering >3× RPS on same hardware. The post is the wiki's first canonical disclosure that P2C-with-active-requests is explicitly retired for LLM serving — the structural argument: high LLM latency × low server count × heavy misrouting cost.

Seen in (prior)

  • sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure of Databricks Model Serving's internals at the platform-engineering (not product-marketing) altitude. Covers the EDS / P2C / autoscaler / image-acceleration platform layer and the FP8 / hybrid-precision / multiprocessing / async-scheduler runtime layer. Operating envelope: 200K+ QPS, sub-1s p99, 4-9's, 50/50-token shape, 750→1,200 QPS per H100 pod (+60%). Joint-engineering shadow testing and progressive load testing as the onboarding methodology. KV-cache quantisation explicitly off (quality vs throughput trade-off not worth pursuing for the Superhuman workload). Engine designed for hybrid-precision from the start so layer-group toggles ship as flags rather than architectural changes.

Prompt caching for foundation-model endpoints (2026-05-22)

Foundation-model endpoints on this platform — served via the FMAPI product layer — inherit the implicit prompt-caching capability rolled out 2026-05-22 to the open-weights model catalog (GPT-OSS, Gemma 3, Llama 3.1 / 3.3, PEFT-served fine-tuned variants). The caching is at the FMAPI layer, with platform-substrate properties — "prompt caches are isolated, only reside in volatile memory and are never persisted" — and applies to batch-inference, pay-per-token, and provisioned- throughput workloads. Disclosed numbers from the GPT-OSS production batch-inference rollout: +2.5× per-replica input-token throughput, 3× P50 latency reduction at 30% cache hit ratio (Source: sources/2026-05-22-databricks-accelerating-llm-inference-with-prompt-caching-for-open-source-models).

Custom Model Serving (2026-06-10)

The 2026-06-10 AI Serving Platform That Adapts to Your Model post discloses the heterogeneous-model face of the Databricks serving platform — Custom Model Serving — handling any MLflow model (2 MB scikit-learn to 70B LLMs) via the AutoPilot Pod Autoscaler (APA), a custom Kubernetes controller implementing two-axis autoscaling (horizontal on active concurrency + vertical on model-aware concurrency tuning). APA discovers each model's optimal concurrency limit at runtime from hardware metrics and latency signals, rather than requiring benchmarking. Operating envelope: 300K+ QPS, 99.99% availability, 10→10K QPS in <60s. Cold-start mitigation via warm node pools, parallel model download, and provisioned concurrency. (Source: sources/2026-06-10-databricks-ai-serving-platform-that-adapts-to-your-model)

Source

Last updated · 542 distilled / 1,571 read