Skip to content

CONCEPT Cited by 1 source

Bitpacking

Definition

Bitpacking is the practice of combining multiple sub-byte elements into a native machine data type (uint8, int32, …) for storage and transport, unpacking them inside the kernel before computation. It's required below 8-bit quantization because 4-bit and smaller formats are not natively supported by GPU load instructions — the memory subsystem can't address 4-bit scalars directly (Source: sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai).

Why it's needed

Quantization reduces the per-element bit count — e.g. 4-bit weights represent 16 discrete levels per element. But GPU DRAM is addressed in bytes, caches in cache lines, and MMA units in tiles of native-sized scalars. A raw 4-bit value has no native container. To bridge:

  • Two 4-bit elements are packed into a single uint8
  • Eight 4-bit elements fit in a single int32
  • The kernel issues a standard 8/32-bit load, then shifts + masks to recover per-element values for the downstream MMA or dequant step.

This unpack is near-free in the pre-MMA pipeline but is a real engineering cost: kernel authors must design unpack layouts, register allocation, and per-element extraction that doesn't stall the matrix unit.

Where it sits in the pipeline

  1. Offline (quantization pass). Model weights are rescaled to 4-bit (or other sub-byte bit-width); each element becomes a small integer.
  2. Packing. Groups of 2 / 4 / 8 sub-byte elements are packed into uint8 / int32 containers, stored contiguously on disk and in device memory.
  3. Kernel load. Standard-width load from DRAM → registers.
  4. Unpack. Bit shifts + masks extract individual sub-byte elements.
  5. Dequant or MMA consume. In pre-MXFP formats the unpacked values then go through explicit dequantization. In MXFP formats the Tensor Core instructions consume packed representations directly with block_scale metadata.

Why MXFP still uses packing

MXFP formats (MXFP8 / MXFP6 / MXFP4 / MXINT8) also store multiple sub-byte elements per container. The difference is that the Tensor Core fuses unpack + scale + MMA into one hardware instruction — bitpacking is still the on-wire layout but no software is involved in the unpack.

Seen in

  • sources/2026-02-12-dropbox-how-low-bit-inference-enables-efficient-ai — Dropbox names bitpacking as the necessary precondition for sub-8-bit quantization to run on GPUs at all: "quantization to lower than 8 bits typically requires an additional process called bitpacking, where multiple low-bit elements are combined into a native data type such as uint8 or int32."
Last updated · 200 distilled / 1,178 read