Databricks — How Superhuman and Databricks built a 200K QPS inference platform together¶

Databricks Engineering post (2026-05-08) co-authored by Databricks Model Serving and Superhuman engineering. Documents the joint migration of Superhuman's grammar-correction model off a self-managed vLLM-on-L40S DIY stack onto Databricks Model Serving on NVIDIA H100. The model serves "over 40 million daily users" across Superhuman's productivity suite (Superhuman, Coda, Superhuman Mail, Superhuman Go), running at peak 200,000+ QPS with sub-1-second end-to-end p99 and four-9's (99.99%) reliability. Per-request shape: ~50 input tokens + 50 output tokens.

This is the wiki's first canonical disclosure of:

Per-pod throughput on a managed inference platform quantified as 750 QPS → 1,200 QPS (60% improvement) on H100 through a stack of software-only optimisations.
CPU-bound serving regime for small, fast LLMs — the GPU completes its forward pass faster than a single CPU process can prepare the next batch, flipping the standard serving-engine bottleneck assumption.
Multiprocessing RPC server architecture as the structural fix for the CPU-bound regime (~20% additional throughput).
Per-channel FP8 scaling as the granularity choice that closes the quality gap on attention quantisation.
Hybrid-precision serving engine — toggle FP8 on/off per layer group via flag, with both teams measuring quality directly.
Block-device-based lazy-loading container image as Databricks' cold-start mitigation, ported from serverless-compute work; cuts pod start time from minutes to seconds.
Endpoint Discovery Service (EDS) powering a custom Power-of-Two- Choices load balancer driven directly off the Kubernetes API, with asymmetric autoscaling (aggressive up, conservative down) tracking per-pod request_concurrency.

One-paragraph summary¶

Superhuman's grammar-correction LLM serves real-time suggestions to 40M+ daily users at peaks above 200K QPS with sub-1s p99 and 4-9's reliability. The pre-migration DIY stack — vLLM on L40S GPUs with an internal ML-platform team — was costing the team months of manual performance tuning per model iteration plus an autoscaling and capacity-planning operational burden. Both teams agreed on shared SLOs (sub-second p99, zero quality regression on Superhuman's evals) and co-engineered the Databricks Model Serving migration in two layers. Platform layer: a custom Power-of-Two-Choices load balancer driven by a lightweight Endpoint Discovery Service (EDS) watching the Kubernetes API; asymmetric autoscaling on per-pod request_concurrency (aggressive scale-up, conservative scale-down to avoid flapping); and lazy-loading container images backed by a virtual block device that cuts pod start from minutes to seconds during traffic ramps. Runtime layer: 60% per-pod throughput improvement (750 QPS → 1,200 QPS on H100) via FP8 quantisation with per-channel scaling (~30% gain), multiprocessing RPC server processes for the CPU-bound regime small fast models hit (~20% gain), and a few-percent contributions from single-call C++ tensor manipulation and async CPU-GPU overlap in the scheduler. Engine designed for hybrid-precision from the start so any layer group can be toggled FP8-on/off without architectural change. Joint-engineering shadow-testing was used to tune autoscaler thresholds and validate quality with both teams in the loop.

Key takeaways¶

Sub-second p99 at 200K+ QPS with 4-9's is feasible on a managed inference platform if both vendor and customer co-engineer to a shared SLO. "Superhuman runs this model at peak traffic of over 200,000 queries per second, with end-to-end latency under 1 second at P99, and strict 4 9's reliability guarantees… Both teams defined target real-time latency SLOs upfront: sub second p99 latency and zero quality regression on Superhuman's internal evaluation harnesses."
The "managed serving" model does not have to give up control. Superhuman retains "full ownership of model training, quantization, and quality standards"; Databricks owns "runtime performance and platform reliability". This is the explicit division of responsibilities, working over shared SLOs, joint quality validation, and progressive load testing during onboarding.
Default Kubernetes round-robin load balancing degrades at high QPS. "While the default Kubernetes round robin load balancer is sufficient at low QPS, our tests revealed that this performance degrades at higher QPS, with uneven request distribution creating hotspots that spike tail latency." The fix: a custom Power-of-Two Choices algorithm (Mitzenmacher) embedded in a lightweight Endpoint Discovery Service that watches the Kubernetes API for Services and EndpointSlices and drives client-side LB. Two candidate pods sampled per request; route to whichever has fewer active requests. (See patterns/power-of-two-choices, systems/databricks-endpoint-discovery-service, patterns/kubernetes-api-driven-custom-load-balancer.)
Autoscaling on request_concurrency averaged across pods, with per-pod targets derived from benchmarking. "The autoscaler tracks request_concurrency averaged across pods, with per-pod concurrency targets derived from benchmarking maximum sustainable RPS per replica." The strategy is intentionally asymmetric: "scale-up is aggressive and responsive, while scale-down is conservative, to prevent the flapping that causes latency spikes." Tuned via "joint shadow testing between Superhuman and Databricks… when to scale aggressively, when to hold steady, and how conservative to be on scale-down." (See concepts/request-concurrency-as-autoscaling-signal, patterns/asymmetric-aggressive-up-conservative-down-autoscaling.)
Lazy-loading container filesystem cuts pod start from minutes to seconds. "This lazy-loading container filesystem eliminates the need to download the entire container image before starting the application, reducing time to start container from several minutes to just a few seconds." Mechanism: at build time, convert the standard gzip-based image to a block-device-based format suitable for lazy loading; at pull time, retrieve only metadata (directory structure, file names, permissions), construct a virtual block device with 4MB sectors, mount it into the container so the application can start immediately. First file read issues a callback to the image fetcher which retrieves the actual block content from the remote registry, caches locally to "prevent repeated network round trips." Adopted from prior Databricks serverless-compute work ("Booting Databricks VMs 7× faster"). (See concepts/lazy-loading-container-filesystem, patterns/block-device-container-image-for-lazy-loading.)
FP8 was the single largest per-pod throughput win — ~30% — and was co-engineered between Superhuman's ML team (prequantising the checkpoint) and Databricks (loading and serving in FP8). "FP8 quantization was the single largest throughput improvement, achieving up to 30% increase in per-pod QPS." Final config: attention projections (Q, K, V, output) and MLP projections all ran through the FP8 path; KV-cache quantisation was disabled because "weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload." Quality measurement against Superhuman's internal evals showed "no measurable quality degradation" for attention quantisation. (See concepts/selective-fp8-quantization, patterns/toggleable-hybrid-precision-quantization.)
Per-channel FP8 scaling beats per-tensor scaling on accuracy at matched throughput. "Off-the-shelf kernels used per-tensor scaling (a single FP8 scale factor for an entire weight tensor). Databricks' kernels use per-channel scaling, computing a separate scale factor per output channel of each linear layer. This preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals." Combined with kernel-level improvements, "per-channel quantization matched or exceeded other open source baselines at the same throughput." (See concepts/per-channel-vs-per-tensor-fp8-scaling.)
The serving engine was designed for hybrid-precision inference from the start so quantisation experiments could ship safely. "Databricks model serving had designed the serving engine to support hybrid-precision inference from the start, so that if any layer group proved too quality-sensitive under quantization, we could keep it in higher precision without changing the overall serving architecture. We shipped a flag that enabled us to toggle attention quantization on and off, so both teams could measure its impact directly." This is the engineering primitive that made the joint quality investigation cheap. (See patterns/toggleable-hybrid-precision-quantization.)
Small fast models flip the serving bottleneck from GPU to CPU. "For most model serving workloads, a single process is more than fast enough to keep the GPU saturated, since the GPU is the bottleneck, not the CPU. But with a small, fast model, the GPU completes its forward pass faster than a single process can prepare the next batch, flipping the bottleneck to the CPU." Fix: multiprocessing RPC server — "By having multiple CPU processes prepare and dispatch work to the GPU in parallel, we eliminated the single-process serialization bottleneck. This delivered another 20% additional throughput." (See concepts/cpu-bound-serving-small-fast-model, patterns/multiprocessing-runtime-for-cpu-bound-serving.)
Async scheduling overlaps CPU-side post-processing with the next GPU forward pass. "We moved CPU-side post-processing off the critical path so it runs concurrently with the next GPU forward pass. Rather than finishing all post-processing for batch N before launching batch N+1, the scheduler dispatches N+1 immediately and handles N's post-processing in parallel. Post-processing also iterates only over the relevant subset of requests rather than the full batch." And: "replaced Python-level tensor slicing, copying, and filling at the start of each CUDA graph decode step with a single C++ call. We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead." Each contribute a few percentage points individually; collectively they round out the 60% per-pod improvement after FP8 + multiprocessing. (See concepts/async-cpu-gpu-pipelined-scheduling.)

Operational numbers¶

Metric	Value	Source
Daily users (Superhuman product family)	40M+	intro
Peak QPS	200,000+	intro
End-to-end p99 latency target	< 1 second	intro
Reliability target	4 9's (99.99%)	intro
Per-request input tokens	~50	"How Superhuman modernized…"
Per-request output tokens	~50	"How Superhuman modernized…"
Per-pod QPS, pre-optimisation (H100)	750	"Runtime optimizations"
Per-pod QPS, post-optimisation (H100)	1,200	"Runtime optimizations"
Per-pod QPS improvement	+60%	"Runtime optimizations"
Throughput gain from FP8 quantisation	up to +30%	"FP8 quantization"
Throughput gain from multiprocessing RPC	+20%	"Eliminating CPU-side bottlenecks"
Throughput gain from C++ tensor ops + async scheduling	"few percentage points" each	"Eliminating CPU-side bottlenecks"
Container start time, before image acceleration	"several minutes"	"image acceleration"
Container start time, after image acceleration	"few seconds"	"image acceleration"
Block device sector size (lazy-loading image)	4 MB	"image acceleration"
Pre-migration GPU class	NVIDIA L40S	"How Superhuman modernized"
Post-migration GPU class	NVIDIA H100	"Runtime optimizations" / final config

Architectural shape¶

                           +------------------------+
                           | Superhuman client       |
                           | (real-time suggestions) |
                           +-----------+------------+
                                       |
                                       v
                  +------------------------------------------+
                  | Databricks Model Serving ingress         |
                  +--+---------------------+-----------------+
                     | xDS / EDS endpoint  |
                     | metadata stream     |
                     v                     v
            +---------------------+    +---------------------+
            | Endpoint Discovery  |--->| Power-of-Two-Choices|
            | Service (EDS)       |    | client-side LB      |
            | watches K8s API:    |    | (sample 2 pods,     |
            |  Services +         |    |  pick fewer active  |
            |  EndpointSlices     |    |  requests)          |
            +---------------------+    +----------+----------+
                                                  |
                                                  v
                  +-------------------------------+----------------+
                  |  GPU pod (H100)                                 |
                  |                                                 |
                  |  +-----------+   +----------------+             |
                  |  | RPC proc  |   | RPC proc       |  ...        |
                  |  +-----+-----+   +-------+--------+             |
                  |        | dispatch         | dispatch            |
                  |        v                  v                     |
                  |  +-------------------------------------+        |
                  |  | Single-threaded C++ tensor prep     |        |
                  |  | (CUDA graph decode step)            |        |
                  |  +------------------+------------------+        |
                  |                     |                           |
                  |                     v                           |
                  |  +-------------------------------------+        |
                  |  | GPU forward pass (FP8 attention +   |        |
                  |  | FP8 MLP; KV-cache stays higher prec)|        |
                  |  +------------------+------------------+        |
                  |                     | async post-process        |
                  |                     v   overlaps next batch     |
                  |  +-------------------------------------+        |
                  |  | Hybrid-precision toggle flag        |        |
                  |  | (per layer group on/off via config) |        |
                  |  +-------------------------------------+        |
                  +------+------------------+--------+-------------+
                         |                  |        |
                         v                  v        v
                  +-----------------------------------------+
                  | Autoscaler — request_concurrency target |
                  | aggressive scale-up, conservative down  |
                  | (asymmetric)                            |
                  +-----------------+-----------------------+
                                    |
                                    v
                  +------------------------------------------+
                  | Container runtime — lazy-loading image   |
                  | (4MB block device, image fetcher cache,  |
                  |  start in seconds, not minutes)          |
                  +------------------------------------------+

Caveats¶

Only the Superhuman grammar-correction workload is described. The 200K QPS / sub-1s p99 / 4-9's claims are about this single endpoint, not Databricks Model Serving in general. Other workloads on the platform may have different latency / throughput / quality trade-offs.
Per-pod 1,200 QPS is at 50/50-token request shape on H100. Doubling input or output tokens roughly doubles compute on the forward pass and would not be expected to deliver 1,200 QPS at the same per-pod cost.
Quality validation is on Superhuman's internal evaluation harnesses, not a public benchmark. Both "no measurable quality degradation" (FP8 attention) and "matched or exceeded other open source baselines at the same throughput" (per-channel scaling) are measurements internal to this collaboration.
The hybrid-precision toggle is a serving-engine flag, not a per-tenant API. No claim is made that customers can toggle FP8 on/off at runtime; the flag was used during the joint engineering investigation.
Autoscaler parameters are not numerically disclosed. "When to scale aggressively, when to hold steady, and how conservative to be on scale-down" are tuned via shadow testing; specific concurrency targets, scale-up rates, and scale-down delays are not given.
The lazy-loading container image performs well "for the relatively small models we served for Superhuman." The post does not claim it generalises to multi-hundred-GB foundation models where weight loading itself dominates startup.
KV-cache quantisation is explicitly off the table for this workload — "weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload." Different workloads (longer context, decoder-heavy generation) may reach a different conclusion.
L40S → H100 is a hardware-class change, not just a software migration. Some of the throughput improvement attributed to software optimisations is enabled by the H100's faster Transformer- Engine FP8 path; the L40S baseline is not separately re-tested with the new optimisations.

Source¶

Original: https://www.databricks.com/blog/how-superhuman-and-databricks-built-200k-qps-inference-platform-together
Raw markdown: raw/databricks/2026-05-08-how-superhuman-and-databricks-built-a-200k-qps-inference-pla-f2df9d99.md
Power-of-Two-Choices reference (Mitzenmacher thesis): https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf
Companion Databricks LB post: Intelligent Kubernetes load balancing
Companion serverless-compute boot post: Booting Databricks VMs 7× faster
Companion fast-PEFT post: Fast PEFT serving at scale

systems/databricks-model-serving — the platform side of the partnership
systems/databricks-endpoint-discovery-service — extended with the 200K-QPS validation datum
systems/superhuman-grammar-correction-model — the workload as a named system
systems/vllm — the pre-migration engine; Superhuman used vLLM's online-quantisation library to prequantise the FP8 checkpoint
systems/nvidia-h100 — extended with per-pod 1,200 QPS small-fast-LLM datum
concepts/cpu-bound-serving-small-fast-model
concepts/request-concurrency-as-autoscaling-signal
concepts/per-channel-vs-per-tensor-fp8-scaling
concepts/lazy-loading-container-filesystem
concepts/async-cpu-gpu-pipelined-scheduling
patterns/block-device-container-image-for-lazy-loading
patterns/kubernetes-api-driven-custom-load-balancer
patterns/multiprocessing-runtime-for-cpu-bound-serving
patterns/asymmetric-aggressive-up-conservative-down-autoscaling
patterns/toggleable-hybrid-precision-quantization
patterns/power-of-two-choices — extended with the 200K-QPS production-validation datum
companies/databricks