Skip to content

PATTERN Cited by 1 source

Grouped linear quantization

Pattern

Share a scale (and optionally a zero-point) across a contiguous group of tensor elements — typically 32, 64, or 128 — rather than per-tensor (one scale for all elements, too coarse) or per-element (one scale per element, prohibitive metadata cost). Grouping substantially reduces quantization error at low bit widths by letting each group track its own dynamic range, while keeping metadata overhead bounded (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The two forms

  • Symmetric: x ≈ scale · q (no zero-point; faster, smaller metadata)
  • Asymmetric: x ≈ scale · q + zero_point — the multiply-add fuses efficiently onto modern GPU hardware, which is why asymmetric dominates production despite its extra metadata cost per group

Canonical instances

Method Bit-width Notes
AWQ A16W4 Activation-aware; preserves outlier-sensitive channels
HQQ (Dropbox OSS) A16W4 Half-Quadratic; on-the-fly, no offline calibration pass
MXFP (OCP spec) MXFP8/6/4 Fixed 32-element blocks, E8M0 scale, hardware-native
NVFP4 FP4 Smaller 16-element group + E4M3 FP8 scale for better numerical stability

AWQ and HQQ are the pre-MXFP software-era instances — they rely on an explicit dequant step before the MMA. MXFP and NVFP are the hardware-era instances — grouping is baked into the instruction contract via block_scale on MMA (see patterns/hardware-native-quantization).

Why grouping works

Neural-network weight distributions are non-uniform — some regions of a tensor have outlier values that would blow up a single per-tensor scale's dynamic range, forcing the rest of the tensor into lossy quantization. Per-channel scales help but are often still too coarse at 4-bit. Per-group scales isolate outliers: a group with an outlier gets a wider scale; adjacent groups without outliers keep tighter precision. The metadata overhead is 1 / group_size of the original tensor size — e.g. 32-element groups add ~3% metadata for 4-bit weights.

Picking group size is an engineering trade:

  • Smaller groups (16, 32) = better accuracy, more metadata, more kernel-side indexing complexity
  • Larger groups (64, 128) = less accuracy, less metadata, simpler kernels

MXFP fixes at 32; NVFP4 trades metadata cost for accuracy with 16.

Relationship to other patterns

Seen in

  • sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical description of grouping as "assigning shared parameters to small blocks of tensor elements rather than individual values" with the concrete 32/64/128 sizes and AWQ/HQQ as the pre-MXFP software realization, MXFP fixed-32 blocks as the hardware realization, and NVFP4's smaller group-size + E4M3 scales as the accuracy refinement.
Last updated · 200 distilled / 1,178 read