SYSTEM Cited by 1 source
Superhuman grammar-correction model¶
Superhuman's custom large language model that powers real-time AI communication assistance (correctness, clarity, tone, style suggestions) across the Superhuman productivity suite — Superhuman, Coda, Superhuman Mail, Superhuman Go — serving "over 40 million daily users across dozens of languages."
The wiki tracks this model as a named system because of the operating envelope the 2026-05-08 Databricks / Superhuman post discloses, which is rare in public engineering writing:
| Property | Value |
|---|---|
| Daily users (product family) | 40M+ |
| Peak QPS | 200,000+ |
| End-to-end p99 latency | < 1 second |
| Reliability target | 4 9's (99.99%) |
| Input token shape | ~50 tokens / request |
| Output token shape | ~50 tokens / request |
| Pre-migration serving stack | DIY vLLM on L40S |
| Post-migration serving stack | Databricks Model Serving on H100 |
| Per-pod QPS post-migration (H100) | 1,200 (after 60% throughput improvement from 750) |
(All numbers from sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together.)
Why it matters as a wiki entity¶
This is the canonical wiki instance of a small, fast LLM at massive QPS — and therefore the canonical instance of the CPU-bound serving regime. With a ~50/50 token shape on H100, the GPU completes its forward pass faster than a single CPU process can prepare the next batch, flipping the standard serving-engine bottleneck assumption from GPU to CPU. Without this datum, the regime would be theoretical; with it, the regime has named hardware, a named endpoint, and a named throughput target.
It is also the canonical instance of:
- Hybrid-precision FP8 inference at production scale — attention (Q, K, V, output) + MLP projections on FP8, KV-cache at higher precision; explicitly "weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload." (See concepts/selective-fp8-quantization.)
- Per-channel FP8 scaling matching or beating per-tensor scaling on quality at matched throughput. (See concepts/per-channel-vs-per-tensor-fp8-scaling.)
- vLLM as the prequantisation library, even after migration off vLLM as the serving engine — Superhuman's ML team prequantised the checkpoint to FP8 "using vLLM's online quantization library, producing a compressed-tensor format checkpoint that Databricks loaded for serving." The post separates engine (Databricks takes over) from toolchain (vLLM stays).
Workload shape and traffic pattern¶
- Strong diurnal patterns with rapid ramps in certain periods "often exceeding 200k QPS". The autoscaler is tuned with asymmetric (aggressive scale-up, conservative scale-down) policies specifically to handle these.
- 50 input tokens + 50 output tokens per request is the canonical Superhuman request shape. This is what makes the per-pod 1,200 QPS H100 number meaningful — doubling input or output tokens roughly doubles forward-pass cost and would not deliver 1,200 QPS at the same per-pod cost.
- Real-time grammar/clarity/tone/style suggestions during user typing is the user-facing surface. Sub-second p99 is the felt-latency requirement for inline suggestion UX.
Pre-migration vs post-migration¶
| Pre-migration (DIY vLLM/L40S) | Post-migration (Databricks/H100) | |
|---|---|---|
| Engine | vLLM | Databricks Model Serving runtime |
| GPU class | L40S | H100 |
| Operations | Internal Superhuman ML-platform team | Joint Databricks + Superhuman |
| Capacity planning | Self-managed | Platform-managed (autoscaler) |
| Performance tuning | "months of manual performance tuning to onboard each new model iteration" | Joint shadow testing |
| Quality validation | Internal eval harness | Same eval harness, joint validation |
| Onboarding model iteration | Months | (Implicitly faster — not numerically disclosed) |
The post frames the migration not as a managed-vs-DIY quality trade-off but as a capacity one — the lean Superhuman ML team "needed to focus on model quality and product innovations" rather than carrying the operational burden of capacity planning, performance tuning, and autoscaler ops at 200K QPS.
SLO model¶
Both teams agreed on the SLOs before migration:
- Sub-second p99 end-to-end latency.
- Zero quality regression on Superhuman's internal evaluation harnesses.
These are the load-bearing primitives that drove the joint- engineering investigation, including the FP8 quantisation layer-by- layer experiments and the hybrid-precision flag-toggling so quality could be measured directly.
Caveats¶
- The model architecture (transformer family, parameter count, attention pattern, layer count) is not disclosed.
- The Superhuman ML team "prequantized the checkpoint to FP8 using vLLM's online quantization library, producing a compressed-tensor format checkpoint" — the original training precision (BF16? FP16?) is not stated.
- Quality validation is on Superhuman's internal evaluation harnesses, not a public benchmark.
- Multi-language support ("dozens of languages") is named but not quantified per language.
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical wiki disclosure. The Superhuman grammar-correction model is the named workload that anchors every per-pod throughput number, every autoscaler shape, every cold-start datum, and every quantisation quality measurement in the post.
Related¶
- systems/databricks-model-serving — current serving platform.
- systems/vllm — pre-migration engine + post-migration prequantisation library.
- systems/nvidia-h100 — post-migration GPU class.
- systems/nvidia-l40s — pre-migration GPU class.
- concepts/cpu-bound-serving-small-fast-model — the regime this workload is the canonical instance of.
- concepts/per-channel-vs-per-tensor-fp8-scaling — quantisation granularity choice for this workload.
- concepts/selective-fp8-quantization — KV-cache quantisation off, attention+MLP on.
- companies/databricks — the platform partner.