Skip to content

CONCEPT Cited by 1 source

Quantization

Definition

Quantization rescales tensor elements from a high-precision floating-point range into a smaller number of discrete levels represented with fewer bits. 8-bit quantization maps values to 256 bins; 4-bit to 16 bins; each tensor element is snapped to the nearest bin and approximates the original floating-point value. The memory footprint shrinks proportionally to bit-width reduction. Going below 8 bits typically also requires bitpacking because sub-byte formats aren't natively addressable by GPU load instructions (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Dimensions

Quantization is not one technique but a family along multiple orthogonal axes:

  • Bit-width. FP16 → FP8 → FP4 → FP2/ternary/binary. Tensor Core FLOPS roughly double per halving of precision.
  • Target. Weights only vs activations only vs both — see patterns/weight-only-vs-activation-quantization.
  • Granularity.
  • Per-tensor (one scale for the whole tensor)
  • Per-channel (one scale per row or column)
  • Per-group (one scale per N contiguous elements, commonly 32/64/128) — see patterns/grouped-linear-quantization
  • Per-block (small tiles, each with its own scale — JetFire, DeepSeek V3)
  • Symmetry.
  • Symmetric: x ≈ scale × q
  • Asymmetric: x ≈ scale × q + zero_point — fuses nicely into modern GPU multiply-add instructions
  • Representation.
  • Linear — the base case; AWQ + HQQ canonical
  • Non-linear — QuiP#, GPTVQ; higher accuracy at very low bits but require custom fused kernels and deep framework integration
  • Calibration.
  • Post-training (PTQ) — calibrate scales on a held-out dataset once
  • Quantization-aware training (QAT) — training simulates quantization; essential for very low bit widths to preserve pre-training accuracy
  • On-the-fly — HQQ can linearly quantize 4-bit on the fly without offline passes, avoiding calibration complexity

Hardware constraint

Quantization formats are only as useful as the GPU matrix units' ability to consume them. Pre-MXFP formats ran into a wall: when activations and weights use different bit-widths, weights must be dequantized up to activation precision before the MMA, adding arithmetic overhead. Under compute-bound conditions this overhead can exceed the gain from narrower weights (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The industry response is patterns/hardware-native-quantization via MXFP-class formats where quantization metadata is consumed inside the MMA instruction (block_scale modifier on tcgen05.mma / mma.sync) — no software dequantization step.

Activation quantization regimes

8-bit activation quantization has two common realizations in modern inference (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai):

  • Channel-wise. One scale per channel. Simple and efficient — rescaling fuses cheaply after the MMA. Default for on-the-fly activation quantization.
  • Per-block. Small tiles of the tensor get independent scales. Limits outlier impact, reduces quantization error, particularly useful for QAT where preserving pre-training accuracy is critical, while still hitting practical Tensor Core speedups. Popularized by JetFire and DeepSeek V3.

Attention quantization

Quantization isn't limited to linear layers. Methods like Flash Attention 3 and Sage Attention use 8-bit quantization inside the attention mechanism itself to accelerate attention-related matrix multiplications — useful at long context lengths where attention dominates compute.

Seen in

Last updated · 200 distilled / 1,178 read