CONCEPT Cited by 1 source

Huffman coding¶

Definition¶

Huffman coding (Wikipedia) is a variable-length prefix code that assigns short bit codes to common symbols and long bit codes to rare symbols. Classic lossless entropy-coding primitive — builds a binary tree greedily bottom-up by merging the two lowest-frequency symbols until one root remains; each symbol's code is its root-to-leaf path. Achieves near-information-theoretic compression for a given symbol distribution.

Why it reappears in LLM weights¶

BF16 exponent distributions in trained LLMs are sharply skewed — the top 16 exponents (of 256 possible) cover >99 % of weights in a typical layer, so the information-theoretic floor is ~2.6 bits, vs the 8 bits BF16 allocates. Huffman coding is the textbook answer for that distributional shape. Cloudflare's Unweight applies it to the exponent byte only (sign + mantissa look random and don't compress meaningfully). (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Inference-time constraints¶

Huffman decoding is serial within a code stream — reading code n+1 requires knowing where code n ended. On a GPU that wants to decompress in parallel, two responses appear:

Fixed-segment boundaries — partition the stream into chunks with known byte offsets so many threads can start decoding independently.
Palette transcoding at load time — once the Huffman stream has been decoded to exponent palette indices, those indices are fixed-width and trivially parallel to consume. This is the substrate of the "direct palette" pipeline in Unweight's four execution paths.

Memory footprint of the lookup table¶

A Huffman decode kernel on Hopper needs ~16 KB of shared memory for its lookup table — small enough that it nearly fits alongside Unweight's ~227 KB reconstructive matmul kernel in the 228 KB SMEM budget, but not quite. The 227 + 16 > 228 collision is the constraint forcing the SM-partitioning autotuner to measure empirically how to split SMs between decoding and matmul.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance at inference-time; Huffman on the BF16 exponent byte yields ~30 % MLP-exponent compression.

concepts/lossless-weight-compression — the problem class Huffman solves here.
concepts/bf16-exponent-redundancy — the empirical distribution that makes Huffman win.
concepts/shared-dictionary-compression / concepts/delta-compression-http — sibling lossless-compression primitives at a different layer (HTTP wire format vs. GPU weight memory).
systems/unweight — the production deployment.