Skip to content

CONCEPT Cited by 1 source

Per-channel vs per-tensor FP8 scaling

Definition

FP8 scaling granularity is the choice of how many FP8-quantised elements share a single floating-point scale factor. Two ends of the spectrum:

  • Per-tensor scalingone scale factor for an entire weight tensor. Cheapest in metadata; coarsest in dynamic-range tracking. Off-the-shelf FP8 kernels typically default to this.
  • Per-channel scaling — a separate scale factor per output channel of each linear layer. More metadata; preserves dynamic range per output column of the weight matrix.

Per-channel sits between per-tensor (1 scale per tensor) and grouped quantisation (1 scale per N-element block; see patterns/grouped-linear-quantization). For FP8 weight quantisation specifically, per-channel is the granularity at which linear-algebra dynamic range becomes manageable without paying the per-block kernel-indexing cost.

Why granularity matters for FP8

FP8 has roughly half the dynamic range of BF16/FP16: ~5 bits of exponent (E5M2) or ~4 bits of exponent (E4M3) vs BF16's 8. When multiple output columns of a linear layer have very different typical magnitudes, a single per-tensor scale forces the smaller columns into lossy quantisation: their dynamic range is "compressed" by the scale picked for the largest column.

Per-channel scaling isolates each output column's dynamic range. A channel with small values gets a tight scale; a channel with large values gets a wider one. The MMA still runs at FP8; only the scale metadata grows from O(1) to O(C_out) per linear layer.

Canonical wiki disclosure

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first explicit disclosure of per-channel-vs-per-tensor as a production-grade FP8 design dimension for LLM serving:

"The other consideration was quantization granularity. Off-the- shelf kernels used per-tensor scaling (a single FP8 scale factor for an entire weight tensor). Databricks' kernels use per-channel scaling, computing a separate scale factor per output channel of each linear layer. This preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals."

— (Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The reported result: "Combined with kernel-level improvements, per- channel quantization matched or exceeded other open source baselines at the same throughput."

How per-channel FP8 fits into the Superhuman serving stack

The full Superhuman FP8 configuration is:

  • Quantised in FP8: attention projections (Q, K, V, output) + MLP projections — covering the dominant compute of the transformer.
  • Per-channel scaling on those layers — one FP8 scale per output channel of each linear layer.
  • Disabled: KV-cache FP8 quantisation (quality tradeoffs "not worth pursuing for this workload").
  • Hybrid-precision toggle on the engine so any layer group can be raised back to higher precision (see patterns/toggleable-hybrid-precision-quantization).

The combination is selective + per-channel: which layers go FP8 is decided by quality measurement; how each FP8 layer scales is per-output-channel. The two granularity choices compose.

Cost model

Per-channel scaling adds:

  • MetadataC_out × sizeof(FP32 scale) per linear layer. For an LLM with ~thousands of output channels per linear layer, this is a few KB per layer — negligible against the multi-MB weight tensor it scales.
  • Kernel arithmetic — one extra fused multiply (the per-channel scale) on the dequant path. Modern Tensor-Core MMAs absorb this cheaply.
  • Calibration cost — the scales have to be computed during prequantisation, e.g. by running a calibration pass on representative activations. Per-channel calibration is more expensive than per-tensor but still a one-time cost.

Per-channel does not change the FP8 throughput on the Tensor Core path — the compute is the same; only the scale metadata grows.

Relationship to grouped / hardware-native quantisation

patterns/grouped-linear-quantization takes the granularity even finer — one scale per 16-128-element block, used by AWQ / HQQ in software and MXFP / NVFP4 in hardware. For FP4 this finer grain is necessary because FP4's dynamic range is too narrow for per-channel to suffice. For FP8 weights Databricks reports per-channel is enough — the dynamic range available in E4M3 / E5M2 covers the per-column variation without further sub-channel grouping.

A defensible mental model:

Bit width Practical scaling granularity
BF16 / FP16 None (the format itself has enough range)
FP8 Per-channel sufficient for most production workloads
FP6 / FP4 Per-block (16-32 elements) — see patterns/grouped-linear-quantization

Seen in

  • sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure of per-channel FP8 scaling as the granularity choice that closes the quality gap on attention
  • MLP projection quantisation for a 200K QPS production LLM serving deployment on H100. Quote: "This preserves dynamic range where it matters, keeps MLP-layer quantization error well below the threshold where it shows up in evals." Result: matched or exceeded open-source baselines at the same throughput.

Caveats

  • The Superhuman result is workload-specific. A different model family or task could find per-channel insufficient and need grouped quantisation; or find per-tensor sufficient and skip per-channel.
  • Calibration data matters. Per-channel scales are only as representative as the activation distribution they're calibrated on; out-of-distribution inputs can still hit dynamic-range cliffs.
  • The post does not disclose the calibration set, the quality-evaluation harness, or the exact quantisation kernel implementation.
  • Per-channel ≠ per-row scaling. Output-channel scaling is on the columns of a row-major weight matrix in standard transformer parameterisation; the orientation matters for kernel implementation.
Last updated · 542 distilled / 1,571 read