PATTERN Cited by 1 source
Toggleable hybrid-precision quantization¶
Pattern¶
Design the inference engine so that any layer group can be toggled on/off into FP8 (or any lower-precision format) independently, via a flag, without changing the overall serving architecture. The pattern combines:
- An engine that natively supports mixed precision at layer granularity — different layer groups can run in different precisions in the same forward pass.
- A runtime / config flag per quantisable layer group that flips precision on or off without redeployment of a different binary.
- Direct quality measurement with the flag in both states so the team can decide empirically whether each layer group's FP8 cost is acceptable.
The pattern's value is engineering optionality: it makes quantisation experiments cheap to run on production-shaped workloads with production-quality eval harnesses. Without it, every quantisation decision is a deploy-and-pray.
Mechanism¶
A serving engine that implements this pattern has three properties:
- Hybrid-precision execution path. The engine's forward-pass pipeline can consume a tensor in FP8 from one layer group and another in BF16 from a different layer group within the same model. Cast / dequantise / re-quantise is built into the inter-layer interfaces.
- Per-layer-group precision config. Layer groups are
addressable in the engine's config — typically by name (e.g.
attention.q_proj,attention.k_proj,attention.v_proj,attention.o_proj,mlp.up_proj,mlp.down_proj,kv_cache) — each with its own precision setting. - Flag-driven toggle. Switching a layer group's precision is a config change, not a code change; deployments can carry both precisions and pick at startup or at runtime.
Why design this in upfront¶
The Databricks post is explicit about the design choice:
"Databricks model serving had designed the serving engine to support hybrid-precision inference from the start, so that if any layer group proved too quality-sensitive under quantization, we could keep it in higher precision without changing the overall serving architecture. We shipped a flag that enabled us to toggle attention quantization on and off, so both teams could measure its impact directly. The experiment landed cleanly, quantizing the Q/K/V and output projections produced no measurable quality degradation on Superhuman's evals."
The pre-emptive engineering paid off: the team could run the attention-FP8 experiment once, with both precisions co-resident in the engine, and read off the quality result on Superhuman's internal eval harness. No engine fork, no deploy churn.
The same flag-driven structure also supports:
- Customer-specific precision policy. A different model could flip FP8 attention off if its eval shows quality regression.
- Incremental rollout. FP8 MLP first, attention later, KV-cache never (in this case) — each gated independently.
- Quick rollback. A regression in production can be reversed with a config flip rather than a binary rollback.
Canonical wiki disclosure: Superhuman 200K QPS LLM serving¶
The Superhuman result, in the canonical configuration:
| Layer group | Precision | Decision basis |
|---|---|---|
| Attention Q, K, V, output projections | FP8 | Toggle experiment showed "no measurable quality degradation" |
| MLP projections (up, down) | FP8 | Quantised from the start |
| KV cache | higher precision (kept) | "KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload" |
The hybrid-precision capability is what made the second-row attention decision a measurement, not a guess; and the third-row KV-cache decision a choice, not a constraint of the engine.
When to use¶
- Engine team and customer both care about quality validation. The flag is the meeting point for joint engineering.
- Multiple quantisable layer groups with different sensitivity profiles. Single-flag global FP8 doesn't need this; selective / per-layer FP8 does.
- Customers with high-quality internal eval harnesses that can detect small regressions. Without a sharp eval, the toggle has no signal to drive decisions.
- Production inference where rollback safety matters.
When not to use¶
- Single-precision deployments where quantisation is one binary choice. The flag adds complexity without payoff.
- Aggressive throughput-only optimisation with no quality budget. The flag's value is in surfacing the quality cost of each layer's quantisation; without a quality concern, the decision is trivial.
- Engines built around fixed kernel paths (e.g. naive per-tensor FP8 from off-the-shelf kernels) where adding hybrid precision is a major rewrite.
Sibling patterns¶
- concepts/selective-fp8-quantization — what to quantise: per-layer micro-benchmark-guided selection of which layers tolerate FP8 (Meta MARM canonical instance). Toggleable hybrid-precision is the engine-level engineering primitive that makes selective FP8 cheap to operate and update.
- patterns/selective-mixed-precision-quantization — the Pinterest / Meta-style pattern of running different precisions for different layers on the same inference call.
- patterns/weight-only-vs-activation-quantization — the orthogonal axis (which side of the MMA gets quantised); a toggleable engine can flip this too.
- patterns/hardware-native-quantization — the substrate (Tensor Core block-scale instructions) on which toggleable hybrid-precision actually executes.
- concepts/per-channel-vs-per-tensor-fp8-scaling — the granularity decision; orthogonal to the toggle but co-tunable on the same engine.
Engineering economics¶
The pattern has up-front cost — building hybrid-precision into the engine before it's strictly needed — paid back when:
- A new model needs different layer-precision policy than the one before. No engine change.
- A customer challenges a precision choice on quality grounds. Reverse the flag in their endpoint; measure; ship the right config.
- The hardware generation changes (FP8 paths get faster / add new layer support); flipping more layer groups becomes attractive without engine rewrite.
The Databricks post canonicalises this as a deliberate design investment that paid off during the Superhuman engagement.
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical wiki instance. Databricks Model Serving's serving engine was "designed to support hybrid-precision inference from the start". The attention-FP8 toggle was used during the Superhuman engagement to measure quality impact directly; result shipped to production with attention + MLP in FP8 (per-channel scaled) and KV cache in higher precision. Combined with FP8's ~30% throughput gain on H100.
Caveats¶
- The post does not disclose how many layer groups are individually addressable in the engine config, nor the granularity at which the flag operates (per-decoder-block? per-layer? per-projection?).
- The flag is internal to Databricks Model Serving — not exposed as a customer-facing API.
- Quality measurement is on the customer's internal evals (Superhuman in this case); the engine flag exists, but the decision still depends on having a sharp eval.
Related¶
- concepts/quantization
- concepts/selective-fp8-quantization
- concepts/per-channel-vs-per-tensor-fp8-scaling
- concepts/low-bit-inference
- patterns/selective-mixed-precision-quantization
- patterns/weight-only-vs-activation-quantization
- patterns/grouped-linear-quantization
- patterns/hardware-native-quantization
- systems/databricks-model-serving