PATTERN Cited by 1 source

Grouped linear quantization¶

Pattern¶

Share a scale (and optionally a zero-point) across a contiguous group of tensor elements — typically 32, 64, or 128 — rather than per-tensor (one scale for all elements, too coarse) or per-element (one scale per element, prohibitive metadata cost). Grouping substantially reduces quantization error at low bit widths by letting each group track its own dynamic range, while keeping metadata overhead bounded (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

The two forms¶

Symmetric: x ≈ scale · q (no zero-point; faster, smaller metadata)
Asymmetric: x ≈ scale · q + zero_point — the multiply-add fuses efficiently onto modern GPU hardware, which is why asymmetric dominates production despite its extra metadata cost per group

Canonical instances¶

Method	Bit-width	Notes
AWQ	A16W4	Activation-aware; preserves outlier-sensitive channels
HQQ (Dropbox OSS)	A16W4	Half-Quadratic; on-the-fly, no offline calibration pass
MXFP (OCP spec)	MXFP8/6/4	Fixed 32-element blocks, E8M0 scale, hardware-native
NVFP4	FP4	Smaller 16-element group + E4M3 FP8 scale for better numerical stability

AWQ and HQQ are the pre-MXFP software-era instances — they rely on an explicit dequant step before the MMA. MXFP and NVFP are the hardware-era instances — grouping is baked into the instruction contract via block_scale on MMA (see patterns/hardware-native-quantization).

Why grouping works¶

Neural-network weight distributions are non-uniform — some regions of a tensor have outlier values that would blow up a single per-tensor scale's dynamic range, forcing the rest of the tensor into lossy quantization. Per-channel scales help but are often still too coarse at 4-bit. Per-group scales isolate outliers: a group with an outlier gets a wider scale; adjacent groups without outliers keep tighter precision. The metadata overhead is 1 / group_size of the original tensor size — e.g. 32-element groups add ~3% metadata for 4-bit weights.

Picking group size is an engineering trade:

Smaller groups (16, 32) = better accuracy, more metadata, more kernel-side indexing complexity
Larger groups (64, 128) = less accuracy, less metadata, simpler kernels

MXFP fixes at 32; NVFP4 trades metadata cost for accuracy with 16.

Relationship to other patterns¶

Basis for patterns/hardware-native-quantization — the block_scale modifier on Tensor Core MMA instructions implements grouped quantization directly in hardware.
Inside of concepts/low-bit-inference — grouping is the primary reason sub-8-bit quantization is production-viable at all.
Independent of patterns/weight-only-vs-activation-quantization — grouping applies to both weight quantization (AWQ, HQQ, MXFP weights) and activation quantization (per-block activation quantization, JetFire, DeepSeek V3).

Seen in¶

sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — canonical description of grouping as "assigning shared parameters to small blocks of tensor elements rather than individual values" with the concrete 32/64/128 sizes and AWQ/HQQ as the pre-MXFP software realization, MXFP fixed-32 blocks as the hardware realization, and NVFP4's smaller group-size + E4M3 scales as the accuracy refinement.