CONCEPT Cited by 1 source
Per-channel vs per-tensor FP8 scaling¶
Definition¶
FP8 scaling granularity is the choice of how many FP8-quantised elements share a single floating-point scale factor. Two ends of the spectrum:
- Per-tensor scaling — one scale factor for an entire weight tensor. Cheapest in metadata; coarsest in dynamic-range tracking. Off-the-shelf FP8 kernels typically default to this.
- Per-channel scaling — a separate scale factor per output channel of each linear layer. More metadata; preserves dynamic range per output column of the weight matrix.
Per-channel sits between per-tensor (1 scale per tensor) and grouped quantisation (1 scale per N-element block; see patterns/grouped-linear-quantization). For FP8 weight quantisation specifically, per-channel is the granularity at which linear-algebra dynamic range becomes manageable without paying the per-block kernel-indexing cost.
Why granularity matters for FP8¶
FP8 has roughly half the dynamic range of BF16/FP16: ~5 bits of exponent (E5M2) or ~4 bits of exponent (E4M3) vs BF16's 8. When multiple output columns of a linear layer have very different typical magnitudes, a single per-tensor scale forces the smaller columns into lossy quantisation: their dynamic range is "compressed" by the scale picked for the largest column.
Per-channel scaling isolates each output column's dynamic range.
A channel with small values gets a tight scale; a channel with large
values gets a wider one. The MMA still runs at FP8; only the scale
metadata grows from O(1) to O(C_out) per linear layer.
Canonical wiki disclosure¶
The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first explicit disclosure of per-channel-vs-per-tensor as a production-grade FP8 design dimension for LLM serving:
"The other consideration was quantization granularity. Off-the- shelf kernels used per-tensor scaling (a single FP8 scale factor for an entire weight tensor). Databricks' kernels use per-channel scaling, computing a separate scale factor per output channel of each linear layer. This preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals."
— (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)
The reported result: "Combined with kernel-level improvements, per- channel quantization matched or exceeded other open source baselines at the same throughput."
How per-channel FP8 fits into the Superhuman serving stack¶
The full Superhuman FP8 configuration is:
- Quantised in FP8: attention projections (Q, K, V, output) + MLP projections — covering the dominant compute of the transformer.
- Per-channel scaling on those layers — one FP8 scale per output channel of each linear layer.
- Disabled: KV-cache FP8 quantisation (quality tradeoffs "not worth pursuing for this workload").
- Hybrid-precision toggle on the engine so any layer group can be raised back to higher precision (see patterns/toggleable-hybrid-precision-quantization).
The combination is selective + per-channel: which layers go FP8 is decided by quality measurement; how each FP8 layer scales is per-output-channel. The two granularity choices compose.
Cost model¶
Per-channel scaling adds:
- Metadata —
C_out × sizeof(FP32 scale)per linear layer. For an LLM with ~thousands of output channels per linear layer, this is a few KB per layer — negligible against the multi-MB weight tensor it scales. - Kernel arithmetic — one extra fused multiply (the per-channel scale) on the dequant path. Modern Tensor-Core MMAs absorb this cheaply.
- Calibration cost — the scales have to be computed during prequantisation, e.g. by running a calibration pass on representative activations. Per-channel calibration is more expensive than per-tensor but still a one-time cost.
Per-channel does not change the FP8 throughput on the Tensor Core path — the compute is the same; only the scale metadata grows.
Relationship to grouped / hardware-native quantisation¶
patterns/grouped-linear-quantization takes the granularity even finer — one scale per 16-128-element block, used by AWQ / HQQ in software and MXFP / NVFP4 in hardware. For FP4 this finer grain is necessary because FP4's dynamic range is too narrow for per-channel to suffice. For FP8 weights Databricks reports per-channel is enough — the dynamic range available in E4M3 / E5M2 covers the per-column variation without further sub-channel grouping.
A defensible mental model:
| Bit width | Practical scaling granularity |
|---|---|
| BF16 / FP16 | None (the format itself has enough range) |
| FP8 | Per-channel sufficient for most production workloads |
| FP6 / FP4 | Per-block (16-32 elements) — see patterns/grouped-linear-quantization |
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure of per-channel FP8 scaling as the granularity choice that closes the quality gap on attention
- MLP projection quantisation for a 200K QPS production LLM serving deployment on H100. Quote: "This preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals." Result: matched or exceeded open-source baselines at the same throughput.
Caveats¶
- The Superhuman result is workload-specific. A different model family or task could find per-channel insufficient and need grouped quantisation; or find per-tensor sufficient and skip per-channel.
- Calibration data matters. Per-channel scales are only as representative as the activation distribution they're calibrated on; out-of-distribution inputs can still hit dynamic-range cliffs.
- The post does not disclose the calibration set, the quality-evaluation harness, or the exact quantisation kernel implementation.
- Per-channel ≠ per-row scaling. Output-channel scaling is on the columns of a row-major weight matrix in standard transformer parameterisation; the orientation matters for kernel implementation.
Related¶
- concepts/quantization — the parent concept
- concepts/selective-fp8-quantization — the layer-selection decision (which layers go FP8) that composes with this granularity decision
- concepts/low-bit-inference — the broader inference family
- patterns/grouped-linear-quantization — the next-step-finer granularity (per-block) used at FP4 / FP6
- patterns/hardware-native-quantization — the Tensor Core path on which both per-tensor and per-channel FP8 ride
- patterns/weight-only-vs-activation-quantization — the orthogonal axis (which side of the MMA gets quantised)
- patterns/toggleable-hybrid-precision-quantization — the engineering primitive that made it cheap to A/B test attention quantisation while iterating on per-channel scaling
- systems/databricks-model-serving — canonical platform instance
- systems/nvidia-h100 — canonical FP8-capable hardware
- systems/vllm — Superhuman's preserved FP8 prequantisation path; the resulting compressed-tensor checkpoint is what Databricks loads under per-channel kernels