Skip to content

CONCEPT Cited by 2 sources

Quantization

Definition

Quantization rescales tensor elements from a high-precision floating-point range into a smaller number of discrete levels represented with fewer bits. 8-bit quantization maps values to 256 bins; 4-bit to 16 bins; each tensor element is snapped to the nearest bin and approximates the original floating-point value. The memory footprint shrinks proportionally to bit-width reduction. Going below 8 bits typically also requires bitpacking because sub-byte formats aren't natively addressable by GPU load instructions (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Dimensions

Quantization is not one technique but a family along multiple orthogonal axes:

  • Bit-width. FP16 → FP8 → FP4 → FP2/ternary/binary. Tensor Core FLOPS roughly double per halving of precision.
  • Target. Weights only vs activations only vs both — see patterns/weight-only-vs-activation-quantization.
  • Granularity.
  • Per-tensor (one scale for the whole tensor)
  • Per-channel (one scale per row or column)
  • Per-group (one scale per N contiguous elements, commonly 32/64/128) — see patterns/grouped-linear-quantization
  • Per-block (small tiles, each with its own scale — JetFire, DeepSeek V3)
  • Symmetry.
  • Symmetric: x ≈ scale × q
  • Asymmetric: x ≈ scale × q + zero_point — fuses nicely into modern GPU multiply-add instructions
  • Representation.
  • Linear — the base case; AWQ + HQQ canonical
  • Non-linear — QuiP#, GPTVQ; higher accuracy at very low bits but require custom fused kernels and deep framework integration
  • Calibration.
  • Post-training (PTQ) — calibrate scales on a held-out dataset once
  • Quantization-aware training (QAT) — training simulates quantization; essential for very low bit widths to preserve pre-training accuracy
  • On-the-fly — HQQ can linearly quantize 4-bit on the fly without offline passes, avoiding calibration complexity

Hardware constraint

Quantization formats are only as useful as the GPU matrix units' ability to consume them. Pre-MXFP formats ran into a wall: when activations and weights use different bit-widths, weights must be dequantized up to activation precision before the MMA, adding arithmetic overhead. Under compute-bound conditions this overhead can exceed the gain from narrower weights (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The industry response is patterns/hardware-native-quantization via MXFP-class formats where quantization metadata is consumed inside the MMA instruction (block_scale modifier on tcgen05.mma / mma.sync) — no software dequantization step.

Activation quantization regimes

8-bit activation quantization has two common realizations in modern inference (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai):

  • Channel-wise. One scale per channel. Simple and efficient — rescaling fuses cheaply after the MMA. Default for on-the-fly activation quantization.
  • Per-block. Small tiles of the tensor get independent scales. Limits outlier impact, reduces quantization error, particularly useful for QAT where preserving pre-training accuracy is critical, while still hitting practical Tensor Core speedups. Popularized by JetFire and DeepSeek V3.

Attention quantization

Quantization isn't limited to linear layers. Methods like Flash Attention 3 and Sage Attention use 8-bit quantization inside the attention mechanism itself to accelerate attention-related matrix multiplications — useful at long context lengths where attention dominates compute.

Seen in

  • sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox's landscape survey of pre-MXFP and MXFP quantization formats; canonical source for the AWQ/HQQ/JetFire/DeepSeek-V3 taxonomy and the mapping from quantization format to Tensor Core performance behavior.
  • sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — cross-reference: Cloudflare's Unweight is the lossless alternative to quantization for the same memory-bandwidth pressure on Hopper-class GPUs. Explicitly avoids quantization because "different 16-bit floating point values can be converted to the same 4-bit integer" and "for production inference serving diverse use cases, we knew we wanted something lossless that preserves exact model behaviour."
  • sources/2025-11-13-instacart-building-the-intent-engineInstacart's explicit production-deployment decision to not ship FP8 quantization despite its latency win. On their fine-tuned Llama-3-8B SRL student on H100, FP8 cut latency by "another 10%" on top of the adapter-merged + H100 baseline, but "with a slight drop in recall. We deployed the unquantized model to prioritize quality." Canonical example of the latency-vs-quality trade-off being resolved in favour of quality in a search-relevance context, even when the headroom existed to ship quantized.
Last updated · 542 distilled / 1,571 read