CONCEPT Cited by 1 source

Bitpacking¶

Definition¶

Bitpacking is the practice of combining multiple sub-byte elements into a native machine data type (uint8, int32, …) for storage and transport, unpacking them inside the kernel before computation. It's required below 8-bit quantization because 4-bit and smaller formats are not natively supported by GPU load instructions — the memory subsystem can't address 4-bit scalars directly (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why it's needed¶

Quantization reduces the per-element bit count — e.g. 4-bit weights represent 16 discrete levels per element. But GPU DRAM is addressed in bytes, caches in cache lines, and MMA units in tiles of native-sized scalars. A raw 4-bit value has no native container. To bridge:

Two 4-bit elements are packed into a single uint8
Eight 4-bit elements fit in a single int32
The kernel issues a standard 8/32-bit load, then shifts + masks to recover per-element values for the downstream MMA or dequant step.

This unpack is near-free in the pre-MMA pipeline but is a real engineering cost: kernel authors must design unpack layouts, register allocation, and per-element extraction that doesn't stall the matrix unit.

Where it sits in the pipeline¶

Offline (quantization pass). Model weights are rescaled to 4-bit (or other sub-byte bit-width); each element becomes a small integer.
Packing. Groups of 2 / 4 / 8 sub-byte elements are packed into uint8 / int32 containers, stored contiguously on disk and in device memory.
Kernel load. Standard-width load from DRAM → registers.
Unpack. Bit shifts + masks extract individual sub-byte elements.
Dequant or MMA consume. In pre-MXFP formats the unpacked values then go through explicit dequantization. In MXFP formats the Tensor Core instructions consume packed representations directly with block_scale metadata.

Why MXFP still uses packing¶

MXFP formats (MXFP8 / MXFP6 / MXFP4 / MXINT8) also store multiple sub-byte elements per container. The difference is that the Tensor Core fuses unpack + scale + MMA into one hardware instruction — bitpacking is still the on-wire layout but no software is involved in the unpack.

Seen in¶

sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox names bitpacking as the necessary precondition for sub-8-bit quantization to run on GPUs at all: "quantization to lower than 8 bits typically requires an additional process called bitpacking, where multiple low-bit elements are combined into a native data type such as uint8 or int32."