Skip to content

DATABRICKS 2026-05-27 Tier 3

Read original ↗

Databricks — Reliable LLM Inference at Scale

Summary

Databricks' inference platform engineering team (Marius Seritan, Cyrielle Simeone, Andy Zhang, Yu Zhang, Nick Lanham) discloses the production architecture behind a multi-tenant LLM-serving substrate processing 125T+ tokens/month on top of frontier GPUs. The post canonicalises two structurally novel building blocks: an LLM-specific request-cost abstraction (model units) that replaces request-count load metrics across both load balancing and autoscaling, and a runtime-reliability layer built on prioritised black-box health checks plus surgical multimodal-pipeline fixes. The team explicitly retires P2C-with-active-requests as inadequate at LLM scale and migrates to load-aware routing on Dicer keyed on model units, with stateful sessions for prefix-cache locality and bounded blast radius. A new system name surfaces — Axon, the inference data-plane router — making this the wiki's first canonical disclosure of the named LLM router that sits between the rate-limiter and the inference runtime. Operating envelope: 125T+ tokens/mo, spiky-within-hours traffic, >80% GPU savings vs static-peak provisioning for bursty workloads, 5-minute full silent-hang detect-and-recover cycle, >3× RPS-per-server jump after multimodal CPU-bottleneck fixes, false liveness probe failures: several/week → zero after prioritising health-check scheduling.

Key takeaways

  1. Frontier GPUs are structurally less reliable than CPU systems under multi-tenant LLM load — and the standard distributed-systems tricks don't work. "Frontier performance requires the latest GPUs with high bandwidth interconnect for KV cache transfer. These compute setups are fundamentally less reliable than classical CPU systems, and they are expensive… all-to-all communication is required, [so] a single node's downtime requires reconfiguration for multiple other nodes in disaggregated prefill/decode setups. The highest bandwidth networking requires single-spine connectivity in a single physical rack (e.g. NVL72 systems). This means failures in specific systems within a single datacenter rack can create a wide-blast-radius outage. Standard tricks in distributed systems like multi-AZ or leveraging backup instance types mean keeping expensive backup GPUs idling, a cost-prohibitive option. Overprovisioning is another classic trick, but given compute supply is so constrained, it's extremely expensive and impractical. Thus, systems must remain operational under heavy strain." The standard reliability playbook (multi-AZ + idle backup + overprovisioning) is structurally infeasible on frontier GPUs.

  2. Latency-and-reliability are coupled, not orthogonal — because server health depends on the request mix. "Even healthy servers under heavier load process all requests more slowly, exposing a tradeoff between throughput (and thus cost efficiency) and the fastest latency that products need to handle. This can also manifest as a reliability problem, since servers can unexpectedly enter unhealthy states very quickly based on the mix of requests assigned to them." This is the load-bearing argument for load-aware routing in model units instead of request count — request shape changes server health.

  3. Model units replace request-count as the LLM load currency. "If we project that a replica can process a fixed number of model units per minute (e.g., 100), we can make the following assumptions: Requests with long input or output consume more model units, since fewer can complete in the same time window. Prefill and decode have different throughput characteristics, so requests with long output cost more than those with long input." Request cost is modelled as a multi-dimensional function cost ≈ α·input + β·output + γ·... with β > α (decode dominates), coefficients determined by automated benchmarking for each model on each hardware type. Adjusted for prefix caching and multi-modality. See concepts/model-units.

  4. The reason model units exist is multi-tenant capacity guarantees — not just routing accuracy. "Such estimations are structurally imperfect, but they serve as a way for us to break a multi-tenant system into something more manageable that resembles cloud VMs. VMs have the desirable property of offering predictable performance that can be allocated to specific customers. For production agentic workloads, it's important to offer guarantees around low latency and capacity, and without such allocation systems, the best we can do is offer 'best-effort' capacity that could be clawed back if too many customers use the system." Model units are the unit-of-account that lets a multi-tenant LLM platform sell predictable-VM-like capacity instead of best-effort. See concepts/multi-tenant-llm-capacity-allocation.

  5. Power-of-Two-Choices with active-request-counts is inadequate at LLM scale. "In general, load balancing tends to lean on statistical approaches like P2C (power of two choices), which estimate load based on queue size and leverage sampling to reduce the memory and latency overheads of understanding all the possible targets. However, LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach." The wiki's first canonical retire-P2C-for-LLM datum, and the structural argument why: high latency × low server count × heavy misrouting cost. Replaced by Dicer-based dynamic routing keyed on model units, not active request count.

  6. The router is named: Axon. "To handle traffic across model deployments, the data plane runs a router, which we call Axon, that balances load among replicas of the same model, and an autoscaler that adjusts replica counts." First canonical wiki disclosure of the named LLM data-plane router. See systems/databricks-axon.

  7. Dicer + model units = cost-based routing AND stateful sessions. "We integrated model units with Dicer so that routing decisions are based on server load in model units rather than traditional request-based heuristics. Dicer also provides stateful sessions, making request routing sticky. A workload's requests go to only a subset of servers, which improves cache hit rates (crucial for latency-sensitive workloads like coding agents) and limits blast radius." Sticky routing serves two purposes simultaneously: KV prefix-cache locality (latency win) AND blast-radius bounding (reliability win). See patterns/cost-based-load-balancing-llm and patterns/stateful-llm-session-routing.

  8. Model unit utilisation ratio is the autoscaling signal — and it's model-agnostic. "Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure." The same autoscaling control loop runs across every served model — one infra knob, many models. See concepts/model-unit-utilization-ratio and patterns/model-units-utilization-autoscaling.

  9. Bursty-traffic autoscaling on model units captures >80% GPU savings vs static-peak. "Building autoscaling on top of LLM inference patterns saved us from always scaling to max replicas. For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak." Quantitative datum on the cost trajectory of LLM-tuned autoscaling vs naive provision-for-peak — applies to the spiky-within-hours workload shape disclosed in Figure 1.

  10. Silent hangs are a real LLM-server failure mode caught by prioritised black-box health checks. "One failure mode we encounter is silent hangs. Requests involving edge cases (structured output, multimodal inputs) can trigger unhandled errors in the multi-process architecture of inference engines, causing servers to stop responding without surfacing errors. We detect this with periodic black-box health checks: minimal end-to-end requests sent when no real requests have completed recently. If a health check fails, the Kubernetes liveness probe restarts the server." See concepts/silent-hang-llm-server.

  11. Health-check priority scheduling is the load-bearing fix against priority-inversion liveness-probe storms. "However, under high load, health checks themselves can time out, causing the liveness probe to kill servers that are actually healthy. This risks cascading failures. To solve this, we assign health check requests the highest scheduling priority, ensuring they complete even under heavy load. With prioritized health checks, the full cycle of detecting a hang, killing the unhealthy server, and recovering takes less than 5 minutes. False liveness probe failures dropped from several per week to zero." This is the canonical solution to the liveness-probe-as-DDoS-amplifier failure mode in inference engines. See patterns/prioritized-black-box-health-check.

  12. Multimodal request handling has a CPU bottleneck distinct from the GPU pipeline — and it's nearly invisible without profiling. "When large batches of multimodal requests arrived, we saw spikes in error rates and timeouts from a completely different source. Investigations revealed that requests weren't even reaching the inference engine's core processes. Serving image requests is more resource-expensive than text-only requests, not just from the additional vision encoder running on GPUs, but also from CPU-intensive image processing. For certain models, the image processing was extremely slow, blocking the event loop entirely." See concepts/multimodal-cpu-bottleneck.

  13. PIL vs Torchvision image processor: 10× CPU difference, default matters. "Among all CPU operations for images, image processing (resizing and normalization) is 10x slower than other operations like base64 decoding. Some Hugging Face models default to the PIL-based image processor, while others use the faster Torchvision-based processor." The default matters because the surrounding inference pipeline assumes a particular cost profile; a 10× slower preprocessor blocks the event loop. See patterns/torchvision-over-pil-image-processing.

  14. OMP_NUM_THREADS defaults to host vCPU count in containers, causing CPU throttling. *"In containerized environments, OMP_NUM_THREADS (which controls the number of OpenMP threads used by Torch for CPU operations) defaults to the number of vCPUs on the host machine. In multitenant setups, this is a poor default: a host might have 192 vCPUs, but a container only has access to

    1. The result is far more running threads than available cores. This drives CPU usage past the container's limit and triggers throttling."* The container-runtime environment-variable inheritance bug specific to OpenMP/Torch on multi-tenant Kubernetes hosts. See concepts/omp-num-threads-container-misconfiguration.
  15. Combined multimodal fixes delivered >3× RPS-per-server jump on the same hardware. "By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs. After the fix shipped, requests completed per second jumped > 3x with the same replicas and load. CPU throttling disappeared, and servers ran in a much healthier state." Quantitative anchor for how much headroom CPU-side bottlenecks can hide on a GPU-loaded inference server.

Architecture (as disclosed)

                                     ┌──────────────────────────┐
                                     │   CONTROL PLANE          │
                                     │                          │
                                     │  ┌────────────────────┐  │
  Request ──▶ Rate Limiting ─────────┼─▶│ Capacity Management│  │
                                     │  │  (model units, per │  │
                                     │  │   workload)        │  │
                                     │  └────────┬───────────┘  │
                                     │           │              │
                                     │           ▼              │
                                     │  ┌────────────────────┐  │
                                     │  │ Autoscaler         │  │
                                     │  │ (model-unit util)  │  │
                                     │  └────────┬───────────┘  │
                                     └───────────┼──────────────┘
                                                 │ replica counts
                                     ┌──────────────────────────┐
                                     │   DATA PLANE             │
                                     │                          │
                                     │  ┌────────────────────┐  │
                                     │  │ Axon (router)      │  │
                                     │  │  - Dicer-backed    │  │
                                     │  │  - load = MUs      │  │
                                     │  │  - sticky sessions │  │
                                     │  └────────┬───────────┘  │
                                     │           │              │
                                     │           ▼              │
                                     │  ┌────────────────────┐  │
                                     │  │ Inference Runtime  │  │
                                     │  │  (in-house engine  │  │
                                     │  │   or vLLM, etc.)   │  │
                                     │  │  on frontier GPUs  │  │
                                     │  └────────────────────┘  │
                                     └──────────────────────────┘

The control-plane / data-plane split here is canonical concepts/control-plane-data-plane-separation. Both control-plane components (capacity management, autoscaler) and the data-plane router (Axon) consume model units as the load currency — "so a small number of expensive long-context requests can trigger different routing and scaling decisions than many cheap short requests."

Operating numbers (disclosed)

Metric Value
Tokens/month served 125T+
Customers named Superhuman, YipitData, Fox Sports, others
Frontier OS models hosted Kimi, Qwen
Frontier proprietary models hosted OpenAI, Gemini, Claude
Hardware Frontier GPUs (e.g. NVL72-class single-spine racks)
Traffic shape Spiky within hours; agentic-workload coupled to working hours
GPU savings (autoscale vs static-peak, bursty workloads) >80%
Hang detect→kill→recover cycle <5 minutes
False liveness-probe failures (pre-prioritisation) several/week
False liveness-probe failures (post-prioritisation) zero
RPS-per-server (post multimodal fix vs pre) >3×

Caveats / open questions

  1. Coefficients of the model-unit cost function (α, β, γ) are not disclosed; "determined by automated benchmarking" but the benchmark harness, the failure modes, and how often coefficients are re-derived are not described.
  2. The exact load metric Dicer reads — is it instantaneous queue-depth-in-MUs, EWMA, prefix-cache-aware? Not disclosed.
  3. Stateful-session affinity policy — what fraction of a workload's requests are routed back to the same subset of servers, and how is that subset sized for cache-hit-rate vs blast-radius? Not quantified.
  4. Fast container start, safe rollouts, GPU capacity management across clouds and regions are explicitly named as topics "there's a lot more to the story" — not in this post. The wiki's adjacent disclosure on lazy-loading container filesystem (2026-05-08 Superhuman post) covers the cold-start piece, but multi-region / multi-cloud capacity allocation has no public disclosure yet.
  5. KV transfer mechanism for disaggregated prefill/decode under model-units routing is not specified — Cloudflare's PD disaggregation post covers a different stack (Workers AI / Mooncake) that may or may not be Databricks' substrate.
  6. No latency numbers (p95 TTFT, OPTS) — the post discusses the contract abstractly but never publishes target or attained p99 curves, even at the customer level. The 2026-05-08 Superhuman post (sub-1s p99, 200K+ QPS) is a per-customer datum, not a platform-wide one.
  7. Health-check priority-scheduling implementation altitude — whether the priority is enforced inside the inference engine (request scheduler), at the runtime layer above it (the Python server), or at the OS/container layer (cgroup nice / SCHED_FIFO) is not disclosed.
  8. What inference engines exhibit silent-hang on edge-case requests"open source and proprietary in-house engines" are both listed as substrates, but the specific edge cases (structured output, multimodal) suggest this is engine-agnostic. No bug trackers / pull requests linked.
  9. Multi-modality cost dimension — model units include a term for "features like multi-modality" but the per-image-token cost coefficient is not separated from the input-token coefficient.
  10. Article is single-direction — only describes how Databricks chose its current architecture, not what alternatives were rejected at decision points (e.g. why not request queueing instead of cost-based routing? why not pure sharding instead of sticky sessions?).

Source

Last updated · 542 distilled / 1,571 read