Skip to content

SYSTEM Cited by 1 source

Superhuman grammar-correction model

Superhuman's custom large language model that powers real-time AI communication assistance (correctness, clarity, tone, style suggestions) across the Superhuman productivity suite — Superhuman, Coda, Superhuman Mail, Superhuman Go — serving "over 40 million daily users across dozens of languages."

The wiki tracks this model as a named system because of the operating envelope the 2026-05-08 Databricks / Superhuman post discloses, which is rare in public engineering writing:

Property Value
Daily users (product family) 40M+
Peak QPS 200,000+
End-to-end p99 latency < 1 second
Reliability target 4 9's (99.99%)
Input token shape ~50 tokens / request
Output token shape ~50 tokens / request
Pre-migration serving stack DIY vLLM on L40S
Post-migration serving stack Databricks Model Serving on H100
Per-pod QPS post-migration (H100) 1,200 (after 60% throughput improvement from 750)

(All numbers from sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together.)

Why it matters as a wiki entity

This is the canonical wiki instance of a small, fast LLM at massive QPS — and therefore the canonical instance of the CPU-bound serving regime. With a ~50/50 token shape on H100, the GPU completes its forward pass faster than a single CPU process can prepare the next batch, flipping the standard serving-engine bottleneck assumption from GPU to CPU. Without this datum, the regime would be theoretical; with it, the regime has named hardware, a named endpoint, and a named throughput target.

It is also the canonical instance of:

  • Hybrid-precision FP8 inference at production scale — attention (Q, K, V, output) + MLP projections on FP8, KV-cache at higher precision; explicitly "weight quantization was where the throughput wins came from and KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload." (See concepts/selective-fp8-quantization.)
  • Per-channel FP8 scaling matching or beating per-tensor scaling on quality at matched throughput. (See concepts/per-channel-vs-per-tensor-fp8-scaling.)
  • vLLM as the prequantisation library, even after migration off vLLM as the serving engine — Superhuman's ML team prequantised the checkpoint to FP8 "using vLLM's online quantization library, producing a compressed-tensor format checkpoint that Databricks loaded for serving." The post separates engine (Databricks takes over) from toolchain (vLLM stays).

Workload shape and traffic pattern

  • Strong diurnal patterns with rapid ramps in certain periods "often exceeding 200k QPS". The autoscaler is tuned with asymmetric (aggressive scale-up, conservative scale-down) policies specifically to handle these.
  • 50 input tokens + 50 output tokens per request is the canonical Superhuman request shape. This is what makes the per-pod 1,200 QPS H100 number meaningful — doubling input or output tokens roughly doubles forward-pass cost and would not deliver 1,200 QPS at the same per-pod cost.
  • Real-time grammar/clarity/tone/style suggestions during user typing is the user-facing surface. Sub-second p99 is the felt-latency requirement for inline suggestion UX.

Pre-migration vs post-migration

Pre-migration (DIY vLLM/L40S) Post-migration (Databricks/H100)
Engine vLLM Databricks Model Serving runtime
GPU class L40S H100
Operations Internal Superhuman ML-platform team Joint Databricks + Superhuman
Capacity planning Self-managed Platform-managed (autoscaler)
Performance tuning "months of manual performance tuning to onboard each new model iteration" Joint shadow testing
Quality validation Internal eval harness Same eval harness, joint validation
Onboarding model iteration Months (Implicitly faster — not numerically disclosed)

The post frames the migration not as a managed-vs-DIY quality trade-off but as a capacity one — the lean Superhuman ML team "needed to focus on model quality and product innovations" rather than carrying the operational burden of capacity planning, performance tuning, and autoscaler ops at 200K QPS.

SLO model

Both teams agreed on the SLOs before migration:

  • Sub-second p99 end-to-end latency.
  • Zero quality regression on Superhuman's internal evaluation harnesses.

These are the load-bearing primitives that drove the joint- engineering investigation, including the FP8 quantisation layer-by- layer experiments and the hybrid-precision flag-toggling so quality could be measured directly.

Caveats

  • The model architecture (transformer family, parameter count, attention pattern, layer count) is not disclosed.
  • The Superhuman ML team "prequantized the checkpoint to FP8 using vLLM's online quantization library, producing a compressed-tensor format checkpoint" — the original training precision (BF16? FP16?) is not stated.
  • Quality validation is on Superhuman's internal evaluation harnesses, not a public benchmark.
  • Multi-language support ("dozens of languages") is named but not quantified per language.

Seen in

Last updated · 542 distilled / 1,571 read