CONCEPT Cited by 1 source
Quantization¶
Definition¶
Quantization rescales tensor elements from a high-precision floating-point range into a smaller number of discrete levels represented with fewer bits. 8-bit quantization maps values to 256 bins; 4-bit to 16 bins; each tensor element is snapped to the nearest bin and approximates the original floating-point value. The memory footprint shrinks proportionally to bit-width reduction. Going below 8 bits typically also requires bitpacking because sub-byte formats aren't natively addressable by GPU load instructions (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Dimensions¶
Quantization is not one technique but a family along multiple orthogonal axes:
- Bit-width. FP16 → FP8 → FP4 → FP2/ternary/binary. Tensor Core FLOPS roughly double per halving of precision.
- Target. Weights only vs activations only vs both — see patterns/weight-only-vs-activation-quantization.
- Granularity.
- Per-tensor (one scale for the whole tensor)
- Per-channel (one scale per row or column)
- Per-group (one scale per N contiguous elements, commonly 32/64/128) — see patterns/grouped-linear-quantization
- Per-block (small tiles, each with its own scale — JetFire, DeepSeek V3)
- Symmetry.
- Symmetric:
x ≈ scale × q - Asymmetric:
x ≈ scale × q + zero_point— fuses nicely into modern GPU multiply-add instructions - Representation.
- Linear — the base case; AWQ + HQQ canonical
- Non-linear — QuiP#, GPTVQ; higher accuracy at very low bits but require custom fused kernels and deep framework integration
- Calibration.
- Post-training (PTQ) — calibrate scales on a held-out dataset once
- Quantization-aware training (QAT) — training simulates quantization; essential for very low bit widths to preserve pre-training accuracy
- On-the-fly — HQQ can linearly quantize 4-bit on the fly without offline passes, avoiding calibration complexity
Hardware constraint¶
Quantization formats are only as useful as the GPU matrix units' ability to consume them. Pre-MXFP formats ran into a wall: when activations and weights use different bit-widths, weights must be dequantized up to activation precision before the MMA, adding arithmetic overhead. Under compute-bound conditions this overhead can exceed the gain from narrower weights (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
The industry response is patterns/hardware-native-quantization
via MXFP-class formats where
quantization metadata is consumed inside the MMA instruction
(block_scale modifier on tcgen05.mma / mma.sync) — no
software dequantization step.
Activation quantization regimes¶
8-bit activation quantization has two common realizations in modern inference (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai):
- Channel-wise. One scale per channel. Simple and efficient — rescaling fuses cheaply after the MMA. Default for on-the-fly activation quantization.
- Per-block. Small tiles of the tensor get independent scales. Limits outlier impact, reduces quantization error, particularly useful for QAT where preserving pre-training accuracy is critical, while still hitting practical Tensor Core speedups. Popularized by JetFire and DeepSeek V3.
Attention quantization¶
Quantization isn't limited to linear layers. Methods like Flash Attention 3 and Sage Attention use 8-bit quantization inside the attention mechanism itself to accelerate attention-related matrix multiplications — useful at long context lengths where attention dominates compute.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox's landscape survey of pre-MXFP and MXFP quantization formats; canonical source for the AWQ/HQQ/JetFire/DeepSeek-V3 taxonomy and the mapping from quantization format to Tensor Core performance behavior.
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — cross-reference: Cloudflare's Unweight is the lossless alternative to quantization for the same memory-bandwidth pressure on Hopper-class GPUs. Explicitly avoids quantization because "different 16-bit floating point values can be converted to the same 4-bit integer" and "for production inference serving diverse use cases, we knew we wanted something lossless that preserves exact model behaviour."