CONCEPT Cited by 1 source
Huffman coding¶
Definition¶
Huffman coding (Wikipedia) is a variable-length prefix code that assigns short bit codes to common symbols and long bit codes to rare symbols. Classic lossless entropy-coding primitive — builds a binary tree greedily bottom-up by merging the two lowest-frequency symbols until one root remains; each symbol's code is its root-to-leaf path. Achieves near-information-theoretic compression for a given symbol distribution.
Why it reappears in LLM weights¶
BF16 exponent distributions in trained LLMs are sharply skewed — the top 16 exponents (of 256 possible) cover >99 % of weights in a typical layer, so the information-theoretic floor is ~2.6 bits, vs the 8 bits BF16 allocates. Huffman coding is the textbook answer for that distributional shape. Cloudflare's Unweight applies it to the exponent byte only (sign + mantissa look random and don't compress meaningfully). (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
Inference-time constraints¶
Huffman decoding is serial within a code stream — reading code
n+1 requires knowing where code n ended. On a GPU that wants
to decompress in parallel, two responses appear:
- Fixed-segment boundaries — partition the stream into chunks with known byte offsets so many threads can start decoding independently.
- Palette transcoding at load time — once the Huffman stream has been decoded to exponent palette indices, those indices are fixed-width and trivially parallel to consume. This is the substrate of the "direct palette" pipeline in Unweight's four execution paths.
Memory footprint of the lookup table¶
A Huffman decode kernel on Hopper needs ~16 KB of shared
memory for its lookup table — small enough that it nearly fits
alongside Unweight's ~227 KB reconstructive matmul kernel in the
228 KB SMEM budget, but not quite. The 227 + 16 > 228 collision
is the constraint forcing the
SM-partitioning autotuner to measure
empirically how to split SMs between decoding and matmul.
Seen in¶
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance at inference-time; Huffman on the BF16 exponent byte yields ~30 % MLP-exponent compression.
Related¶
- concepts/lossless-weight-compression — the problem class Huffman solves here.
- concepts/bf16-exponent-redundancy — the empirical distribution that makes Huffman win.
- concepts/shared-dictionary-compression / concepts/delta-compression-http — sibling lossless-compression primitives at a different layer (HTTP wire format vs. GPU weight memory).
- systems/unweight — the production deployment.