Skip to content

CONCEPT Cited by 1 source

Huffman coding

Definition

Huffman coding (Wikipedia) is a variable-length prefix code that assigns short bit codes to common symbols and long bit codes to rare symbols. Classic lossless entropy-coding primitive — builds a binary tree greedily bottom-up by merging the two lowest-frequency symbols until one root remains; each symbol's code is its root-to-leaf path. Achieves near-information-theoretic compression for a given symbol distribution.

Why it reappears in LLM weights

BF16 exponent distributions in trained LLMs are sharply skewed — the top 16 exponents (of 256 possible) cover >99 % of weights in a typical layer, so the information-theoretic floor is ~2.6 bits, vs the 8 bits BF16 allocates. Huffman coding is the textbook answer for that distributional shape. Cloudflare's Unweight applies it to the exponent byte only (sign + mantissa look random and don't compress meaningfully). (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Inference-time constraints

Huffman decoding is serial within a code stream — reading code n+1 requires knowing where code n ended. On a GPU that wants to decompress in parallel, two responses appear:

  • Fixed-segment boundaries — partition the stream into chunks with known byte offsets so many threads can start decoding independently.
  • Palette transcoding at load time — once the Huffman stream has been decoded to exponent palette indices, those indices are fixed-width and trivially parallel to consume. This is the substrate of the "direct palette" pipeline in Unweight's four execution paths.

Memory footprint of the lookup table

A Huffman decode kernel on Hopper needs ~16 KB of shared memory for its lookup table — small enough that it nearly fits alongside Unweight's ~227 KB reconstructive matmul kernel in the 228 KB SMEM budget, but not quite. The 227 + 16 > 228 collision is the constraint forcing the SM-partitioning autotuner to measure empirically how to split SMs between decoding and matmul.

Seen in

Last updated · 200 distilled / 1,178 read