Skip to content

PATTERN Cited by 1 source

Toggleable hybrid-precision quantization

Pattern

Design the inference engine so that any layer group can be toggled on/off into FP8 (or any lower-precision format) independently, via a flag, without changing the overall serving architecture. The pattern combines:

  • An engine that natively supports mixed precision at layer granularity — different layer groups can run in different precisions in the same forward pass.
  • A runtime / config flag per quantisable layer group that flips precision on or off without redeployment of a different binary.
  • Direct quality measurement with the flag in both states so the team can decide empirically whether each layer group's FP8 cost is acceptable.

The pattern's value is engineering optionality: it makes quantisation experiments cheap to run on production-shaped workloads with production-quality eval harnesses. Without it, every quantisation decision is a deploy-and-pray.

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

Mechanism

A serving engine that implements this pattern has three properties:

  1. Hybrid-precision execution path. The engine's forward-pass pipeline can consume a tensor in FP8 from one layer group and another in BF16 from a different layer group within the same model. Cast / dequantise / re-quantise is built into the inter-layer interfaces.
  2. Per-layer-group precision config. Layer groups are addressable in the engine's config — typically by name (e.g. attention.q_proj, attention.k_proj, attention.v_proj, attention.o_proj, mlp.up_proj, mlp.down_proj, kv_cache) — each with its own precision setting.
  3. Flag-driven toggle. Switching a layer group's precision is a config change, not a code change; deployments can carry both precisions and pick at startup or at runtime.

Why design this in upfront

The Databricks post is explicit about the design choice:

"Databricks model serving had designed the serving engine to support hybrid-precision inference from the start, so that if any layer group proved too quality-sensitive under quantization, we could keep it in higher precision without changing the overall serving architecture. We shipped a flag that enabled us to toggle attention quantization on and off, so both teams could measure its impact directly. The experiment landed cleanly, quantizing the Q/K/V and output projections produced no measurable quality degradation on Superhuman's evals."

The pre-emptive engineering paid off: the team could run the attention-FP8 experiment once, with both precisions co-resident in the engine, and read off the quality result on Superhuman's internal eval harness. No engine fork, no deploy churn.

The same flag-driven structure also supports:

  • Customer-specific precision policy. A different model could flip FP8 attention off if its eval shows quality regression.
  • Incremental rollout. FP8 MLP first, attention later, KV-cache never (in this case) — each gated independently.
  • Quick rollback. A regression in production can be reversed with a config flip rather than a binary rollback.

Canonical wiki disclosure: Superhuman 200K QPS LLM serving

The Superhuman result, in the canonical configuration:

Layer group Precision Decision basis
Attention Q, K, V, output projections FP8 Toggle experiment showed "no measurable quality degradation"
MLP projections (up, down) FP8 Quantised from the start
KV cache higher precision (kept) "KV-cache quantization introduced its own quality tradeoffs that weren't worth pursuing for this workload"

The hybrid-precision capability is what made the second-row attention decision a measurement, not a guess; and the third-row KV-cache decision a choice, not a constraint of the engine.

When to use

  • Engine team and customer both care about quality validation. The flag is the meeting point for joint engineering.
  • Multiple quantisable layer groups with different sensitivity profiles. Single-flag global FP8 doesn't need this; selective / per-layer FP8 does.
  • Customers with high-quality internal eval harnesses that can detect small regressions. Without a sharp eval, the toggle has no signal to drive decisions.
  • Production inference where rollback safety matters.

When not to use

  • Single-precision deployments where quantisation is one binary choice. The flag adds complexity without payoff.
  • Aggressive throughput-only optimisation with no quality budget. The flag's value is in surfacing the quality cost of each layer's quantisation; without a quality concern, the decision is trivial.
  • Engines built around fixed kernel paths (e.g. naive per-tensor FP8 from off-the-shelf kernels) where adding hybrid precision is a major rewrite.

Sibling patterns

Engineering economics

The pattern has up-front cost — building hybrid-precision into the engine before it's strictly needed — paid back when:

  • A new model needs different layer-precision policy than the one before. No engine change.
  • A customer challenges a precision choice on quality grounds. Reverse the flag in their endpoint; measure; ship the right config.
  • The hardware generation changes (FP8 paths get faster / add new layer support); flipping more layer groups becomes attractive without engine rewrite.

The Databricks post canonicalises this as a deliberate design investment that paid off during the Superhuman engagement.

Seen in

Caveats

  • The post does not disclose how many layer groups are individually addressable in the engine config, nor the granularity at which the flag operates (per-decoder-block? per-layer? per-projection?).
  • The flag is internal to Databricks Model Serving — not exposed as a customer-facing API.
  • Quality measurement is on the customer's internal evals (Superhuman in this case); the engine flag exists, but the decision still depends on having a sharp eval.
Last updated · 542 distilled / 1,571 read