CONCEPT Cited by 1 source
Bitpacking¶
Definition¶
Bitpacking is the practice of combining multiple sub-byte
elements into a native machine data type (uint8, int32, …) for
storage and transport, unpacking them inside the kernel before
computation. It's required below 8-bit quantization because
4-bit and smaller formats are not natively supported by GPU load
instructions — the memory subsystem can't address 4-bit scalars
directly (Source:
sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).
Why it's needed¶
Quantization reduces the per-element bit count — e.g. 4-bit weights represent 16 discrete levels per element. But GPU DRAM is addressed in bytes, caches in cache lines, and MMA units in tiles of native-sized scalars. A raw 4-bit value has no native container. To bridge:
- Two 4-bit elements are packed into a single
uint8 - Eight 4-bit elements fit in a single
int32 - The kernel issues a standard 8/32-bit load, then shifts + masks to recover per-element values for the downstream MMA or dequant step.
This unpack is near-free in the pre-MMA pipeline but is a real engineering cost: kernel authors must design unpack layouts, register allocation, and per-element extraction that doesn't stall the matrix unit.
Where it sits in the pipeline¶
- Offline (quantization pass). Model weights are rescaled to 4-bit (or other sub-byte bit-width); each element becomes a small integer.
- Packing. Groups of 2 / 4 / 8 sub-byte elements are packed
into
uint8/int32containers, stored contiguously on disk and in device memory. - Kernel load. Standard-width load from DRAM → registers.
- Unpack. Bit shifts + masks extract individual sub-byte elements.
- Dequant or MMA consume. In pre-MXFP formats the unpacked
values then go through explicit dequantization. In
MXFP formats the Tensor
Core instructions consume packed representations directly with
block_scalemetadata.
Why MXFP still uses packing¶
MXFP formats (MXFP8 / MXFP6 / MXFP4 / MXINT8) also store multiple sub-byte elements per container. The difference is that the Tensor Core fuses unpack + scale + MMA into one hardware instruction — bitpacking is still the on-wire layout but no software is involved in the unpack.
Seen in¶
- sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox names bitpacking as the necessary precondition for sub-8-bit quantization to run on GPUs at all: "quantization to lower than 8 bits typically requires an additional process called bitpacking, where multiple low-bit elements are combined into a native data type such as uint8 or int32."