PATTERN Cited by 1 source
Grouped linear quantization¶
Pattern¶
Share a scale (and optionally a zero-point) across a contiguous group of tensor elements — typically 32, 64, or 128 — rather than per-tensor (one scale for all elements, too coarse) or per-element (one scale per element, prohibitive metadata cost). Grouping substantially reduces quantization error at low bit widths by letting each group track its own dynamic range, while keeping metadata overhead bounded (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
The two forms¶
- Symmetric:
x ≈ scale · q(no zero-point; faster, smaller metadata) - Asymmetric:
x ≈ scale · q + zero_point— the multiply-add fuses efficiently onto modern GPU hardware, which is why asymmetric dominates production despite its extra metadata cost per group
Canonical instances¶
| Method | Bit-width | Notes |
|---|---|---|
| AWQ | A16W4 | Activation-aware; preserves outlier-sensitive channels |
| HQQ (Dropbox OSS) | A16W4 | Half-Quadratic; on-the-fly, no offline calibration pass |
| MXFP (OCP spec) | MXFP8/6/4 | Fixed 32-element blocks, E8M0 scale, hardware-native |
| NVFP4 | FP4 | Smaller 16-element group + E4M3 FP8 scale for better numerical stability |
AWQ and HQQ are the pre-MXFP software-era instances — they
rely on an explicit dequant step before the
MMA. MXFP and NVFP
are the hardware-era instances — grouping is baked into the
instruction contract via block_scale on MMA (see
patterns/hardware-native-quantization).
Why grouping works¶
Neural-network weight distributions are non-uniform — some regions of a tensor have outlier values that would blow up a single per-tensor scale's dynamic range, forcing the rest of the tensor into lossy quantization. Per-channel scales help but are often still too coarse at 4-bit. Per-group scales isolate outliers: a group with an outlier gets a wider scale; adjacent groups without outliers keep tighter precision. The metadata overhead is 1 / group_size of the original tensor size — e.g. 32-element groups add ~3% metadata for 4-bit weights.
Picking group size is an engineering trade:
- Smaller groups (16, 32) = better accuracy, more metadata, more kernel-side indexing complexity
- Larger groups (64, 128) = less accuracy, less metadata, simpler kernels
MXFP fixes at 32; NVFP4 trades metadata cost for accuracy with 16.
Relationship to other patterns¶
- Basis for patterns/hardware-native-quantization — the
block_scalemodifier on Tensor Core MMA instructions implements grouped quantization directly in hardware. - Inside of concepts/low-bit-inference — grouping is the primary reason sub-8-bit quantization is production-viable at all.
- Independent of patterns/weight-only-vs-activation-quantization — grouping applies to both weight quantization (AWQ, HQQ, MXFP weights) and activation quantization (per-block activation quantization, JetFire, DeepSeek V3).
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical description of grouping as "assigning shared parameters to small blocks of tensor elements rather than individual values" with the concrete 32/64/128 sizes and AWQ/HQQ as the pre-MXFP software realization, MXFP fixed-32 blocks as the hardware realization, and NVFP4's smaller group-size + E4M3 scales as the accuracy refinement.