SYSTEM Cited by 1 source
Unweight¶
Definition¶
Unweight is Cloudflare's lossless LLM weight compression system for NVIDIA H100 inference. It shrinks model footprint 15–22 % while preserving bit-exact outputs, without special hardware. The core technique is Huffman coding of the redundant BF16 exponent byte paired with a custom CUDA kernel that reconstructs weights in on-chip shared memory and feeds them directly to tensor cores — avoiding a round-trip through slow HBM. Announced 2026-04-17; GPU kernels open-sourced as unweight-kernels. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
Why it exists¶
On an H100 the tensor cores can process data ~600× faster than HBM can deliver it. LLM token generation requires reading every weight from HBM per token → the binding resource is memory bandwidth, not compute. "Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." Compression of weights thus buys directly into decode throughput if the decompression overhead is hidden.
Prior work (ZipNN, Huff-LLM, ZipServ) fell short for Cloudflare's use case: CPU-side decompression, FPGA hardware, or consumer-grade-GPU targeting. None shipped lossless, inference-time, Hopper-GPU, Rust-engine-integrable.
Design¶
Compression format¶
- Target: the 8-bit exponent byte of BF16 weights. Sign + mantissa look random and don't compress. (concepts/bf16-exponent-redundancy)
- Coding: Huffman on exponents. Top-16 exponents cover >99 % of the distribution; information-theoretic floor ~2.6 bits vs 8 allocated → ~30 % exponent compression.
- Row-level escape for rare exponents: weights grouped into rows of 64; any row whose weights include an out-of-palette exponent is stored verbatim. No per-element branching in the hot path — one decision per row.
- MLP-only: Unweight compresses gate / up / down MLP projections (~⅔ of model parameters, dominating decode memory traffic). Attention weights, embeddings, layer norms stay uncompressed.
Four execution pipelines¶
The runtime picks between four paths per (weight matrix, batch
size):
| Pipeline | Preprocess step | Matmul kernel | Preprocess HBM writes | Matmul complexity |
|---|---|---|---|---|
| Full decode | Huffman → full BF16 in HBM | stock cuBLAS | largest | simplest |
| Exponent-only | Exponent → BF16 exponent in HBM | reconstructive | ½ of full | medium |
| Palette transcode | Transcode to 4-bit palette index in HBM | reconstructive | ¼ of full | medium |
| Direct palette | none (4-bit palette pre-baked at load) | reconstructive | 0 | heaviest |
Trade-off: less preprocessing → less data written to HBM → bus freed sooner, but more reconstruction work falls on the matmul. At small batches cuBLAS's low overhead + tiny matmul favour full decode. At large batches (256+ tokens) reconstruction overlaps with the big matmul and palette / exponent pipelines pull ahead. Different matrices within the same layer (gate / up / down have different shapes) can favour different pipelines. "There's no single best way." (patterns/autotuned-execution-pipeline-selection)
Reconstructive matmul kernel¶
Three of the four pipelines use a custom Hopper-WGMMA kernel:
- Load compressed data from HBM to SMEM (via TMA).
- Reconstruct BF16 in SMEM.
- Feed the tensor cores (WGMMA) directly — the reconstructed weights never exist in main memory. (concepts/fused-decompression-matmul, concepts/hbm-vs-smem)
Thread groups inside the kernel split into:
- Producer group — drives HBM→SMEM transfer via TMA into a circular buffer; stages sign+mantissa bytes, exponent data (or palette indices), and verbatim-row data for rare-exponent rows.
- Consumer groups — reconstruct BF16 from exponent-plus-sign-plus-mantissa and feed WGMMA.
Variants differ in output tile width (wider tiles reuse data better at large batch) and circular-buffer depth (deeper buffers hide memory latency at small batch). Autotuner picks the variant. (patterns/sm-partitioning-producer-consumer)
SM-level shared-memory contention¶
On Hopper each SM has 228 KB of SMEM. The reconstructive
matmul needs ~227 KB for pipeline buffer + accumulator tiles.
A Huffman decode kernel needs ~16 KB for lookup table. Since
227 + 16 > 228, the two kernels cannot share the same SM.
Every SM assigned to decoding is one fewer SM available to the
matmul — the split is a tunable parameter the autotuner measures
empirically.
Easy / hard layer alternation¶
Layers classify as:
- Hard — requires Huffman preprocessing at runtime.
- Easy — pre-transcoded palette at model load; matmul consumes directly.
The runtime alternates: while the GPU computes an easy layer (no preprocess needed), a separate CUDA stream decodes the next hard layer's weights. Double-buffered preprocess slots prevent overwrite while a slot's output is still being consumed. The down projection benefits most — consumed last in the MLP sequence (after gate, activation, up), so its decode has the longest runway.
One format, two uses¶
The same Huffman-compressed bundle serves:
- Distribution — maximum compression (~22 % model-size reduction), reduces transfer time shipping models across the Cloudflare network.
- Inference — Huffman-encoded projections transcoded to the palette intermediate format at model load, enabling the most efficient runtime execution without constraining the distribution format.
The runtime still selects per-projection per-batch-size on the fly.
Measured results (Llama-3.1-8B, H100 SXM5)¶
- Model footprint reduction: ~13 % (inference, gate/up only) / ~22 % (distribution, all MLP).
- Absolute saved: ~3 GB per model instance. Extrapolated ~18–28 GB on Llama-70B depending on configuration.
- Throughput overhead: ~41 % at batch 1, narrowing to ~30 % at batch 1024.
- Bit-exact lossless by construction.
Three known sources of the overhead, all under active optimization:
- Small-batch fixed costs in the reconstructive matmul.
- Redundant weight-tile reconstruction at large batch sizes.
- Down projection still uncompressed (~⅓ of compressible weights).
Positioning vs. prior work¶
- ZipNN: distribution + storage only; CPU decode.
- Huff-LLM: proposes custom FPGA decode hardware.
- ZipServ: fuses decompression with GPU inference but targets consumer-grade GPUs, "which don't work with our H100 GPUs".
- Unweight: lossless inference-time decompression on Hopper datacenter GPUs, Rust-inference-engine integrable, four adaptive pipelines per batch size, ships open-source kernels + a technical paper.
Roadmap¶
- Down projection compression — different kernel variant due to transposed dimensions; closes the last ⅓ of compressible weights.
- Kernel optimization — small-batch fixed costs; redundant weight-tile reconstruction at large batches.
- More models — Llama-3.1-8B → larger Workers-AI-served models; SwiGLU-architecture exponent statistics consistent across scales.
- Mixture-of-Experts — where cold experts are fetched on demand, reduced storage would reduce fetch cost further.
Integration¶
- Runs on Workers AI as a VRAM-reduction lever beside Infire's activation-memory discipline. Unweight attacks weights; Infire attacks activations. Savings are additive into KV-cache headroom.
- Target hardware: NVIDIA H100 (Hopper) only at launch, using
wgmma+ TMA. - Open-source kernels: systems/unweight-kernels.
Seen in¶
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — launch post + design + Llama-3.1-8B numbers; honest overhead framing + roadmap.
Related¶
- Systems: systems/workers-ai, systems/infire, systems/nvidia-tensor-core, systems/unweight-kernels, systems/kimi-k2-5, systems/vllm.
- Concepts: concepts/lossless-weight-compression, concepts/huffman-coding, concepts/bf16-exponent-redundancy, concepts/memory-bandwidth-bound-inference, concepts/fused-decompression-matmul, concepts/hbm-vs-smem, concepts/quantization, concepts/kv-cache.
- Patterns: patterns/fused-decompress-tensor-core-matmul, patterns/autotuned-execution-pipeline-selection, patterns/sm-partitioning-producer-consumer, patterns/upstream-the-fix, patterns/measurement-driven-micro-optimization.
- Contrast: patterns/hardware-native-quantization — Dropbox / NVIDIA's lossy-compressed-operands-inside-the-MMA path; trades bit-exactness for higher compression.