SYSTEM Cited by 1 source

Unweight¶

Definition¶

Unweight is Cloudflare's lossless LLM weight compression system for NVIDIA H100 inference. It shrinks model footprint 15–22 % while preserving bit-exact outputs, without special hardware. The core technique is Huffman coding of the redundant BF16 exponent byte paired with a custom CUDA kernel that reconstructs weights in on-chip shared memory and feeds them directly to tensor cores — avoiding a round-trip through slow HBM. Announced 2026-04-17; GPU kernels open-sourced as unweight-kernels. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it exists¶

On an H100 the tensor cores can process data ~600× faster than HBM can deliver it. LLM token generation requires reading every weight from HBM per token → the binding resource is memory bandwidth, not compute. "Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." Compression of weights thus buys directly into decode throughput if the decompression overhead is hidden.

Prior work (ZipNN, Huff-LLM, ZipServ) fell short for Cloudflare's use case: CPU-side decompression, FPGA hardware, or consumer-grade-GPU targeting. None shipped lossless, inference-time, Hopper-GPU, Rust-engine-integrable.

Design¶

Compression format¶

Target: the 8-bit exponent byte of BF16 weights. Sign + mantissa look random and don't compress. (concepts/bf16-exponent-redundancy)
Coding: Huffman on exponents. Top-16 exponents cover >99 % of the distribution; information-theoretic floor ~2.6 bits vs 8 allocated → ~30 % exponent compression.
Row-level escape for rare exponents: weights grouped into rows of 64; any row whose weights include an out-of-palette exponent is stored verbatim. No per-element branching in the hot path — one decision per row.
MLP-only: Unweight compresses gate / up / down MLP projections (~⅔ of model parameters, dominating decode memory traffic). Attention weights, embeddings, layer norms stay uncompressed.

Four execution pipelines¶

The runtime picks between four paths per (weight matrix, batch size):

Pipeline	Preprocess step	Matmul kernel	Preprocess HBM writes	Matmul complexity
Full decode	Huffman → full BF16 in HBM	stock cuBLAS	largest	simplest
Exponent-only	Exponent → BF16 exponent in HBM	reconstructive	½ of full	medium
Palette transcode	Transcode to 4-bit palette index in HBM	reconstructive	¼ of full	medium
Direct palette	none (4-bit palette pre-baked at load)	reconstructive	0	heaviest

Trade-off: less preprocessing → less data written to HBM → bus freed sooner, but more reconstruction work falls on the matmul. At small batches cuBLAS's low overhead + tiny matmul favour full decode. At large batches (256+ tokens) reconstruction overlaps with the big matmul and palette / exponent pipelines pull ahead. Different matrices within the same layer (gate / up / down have different shapes) can favour different pipelines. "There's no single best way." (patterns/autotuned-execution-pipeline-selection)

Reconstructive matmul kernel¶

Three of the four pipelines use a custom Hopper-WGMMA kernel:

Load compressed data from HBM to SMEM (via TMA).
Reconstruct BF16 in SMEM.
Feed the tensor cores (WGMMA) directly — the reconstructed weights never exist in main memory. (concepts/fused-decompression-matmul, concepts/hbm-vs-smem)

Thread groups inside the kernel split into:

Producer group — drives HBM→SMEM transfer via TMA into a circular buffer; stages sign+mantissa bytes, exponent data (or palette indices), and verbatim-row data for rare-exponent rows.
Consumer groups — reconstruct BF16 from exponent-plus-sign-plus-mantissa and feed WGMMA.

Variants differ in output tile width (wider tiles reuse data better at large batch) and circular-buffer depth (deeper buffers hide memory latency at small batch). Autotuner picks the variant. (patterns/sm-partitioning-producer-consumer)

SM-level shared-memory contention¶

On Hopper each SM has 228 KB of SMEM. The reconstructive matmul needs ~227 KB for pipeline buffer + accumulator tiles. A Huffman decode kernel needs ~16 KB for lookup table. Since 227 + 16 > 228, the two kernels cannot share the same SM. Every SM assigned to decoding is one fewer SM available to the matmul — the split is a tunable parameter the autotuner measures empirically.

Easy / hard layer alternation¶

Layers classify as:

Hard — requires Huffman preprocessing at runtime.
Easy — pre-transcoded palette at model load; matmul consumes directly.

The runtime alternates: while the GPU computes an easy layer (no preprocess needed), a separate CUDA stream decodes the next hard layer's weights. Double-buffered preprocess slots prevent overwrite while a slot's output is still being consumed. The down projection benefits most — consumed last in the MLP sequence (after gate, activation, up), so its decode has the longest runway.

One format, two uses¶

The same Huffman-compressed bundle serves:

Distribution — maximum compression (~22 % model-size reduction), reduces transfer time shipping models across the Cloudflare network.
Inference — Huffman-encoded projections transcoded to the palette intermediate format at model load, enabling the most efficient runtime execution without constraining the distribution format.

The runtime still selects per-projection per-batch-size on the fly.

Measured results (Llama-3.1-8B, H100 SXM5)¶

Model footprint reduction: ~13 % (inference, gate/up only) / ~22 % (distribution, all MLP).
Absolute saved: ~3 GB per model instance. Extrapolated ~18–28 GB on Llama-70B depending on configuration.
Throughput overhead: ~41 % at batch 1, narrowing to ~30 % at batch 1024.
Bit-exact lossless by construction.

Three known sources of the overhead, all under active optimization:

Small-batch fixed costs in the reconstructive matmul.
Redundant weight-tile reconstruction at large batch sizes.
Down projection still uncompressed (~⅓ of compressible weights).

Positioning vs. prior work¶

ZipNN: distribution + storage only; CPU decode.
Huff-LLM: proposes custom FPGA decode hardware.
ZipServ: fuses decompression with GPU inference but targets consumer-grade GPUs, "which don't work with our H100 GPUs".
Unweight: lossless inference-time decompression on Hopper datacenter GPUs, Rust-inference-engine integrable, four adaptive pipelines per batch size, ships open-source kernels + a technical paper.

Roadmap¶

Down projection compression — different kernel variant due to transposed dimensions; closes the last ⅓ of compressible weights.
Kernel optimization — small-batch fixed costs; redundant weight-tile reconstruction at large batches.
More models — Llama-3.1-8B → larger Workers-AI-served models; SwiGLU-architecture exponent statistics consistent across scales.
Mixture-of-Experts — where cold experts are fetched on demand, reduced storage would reduce fetch cost further.

Integration¶

Runs on Workers AI as a VRAM-reduction lever beside Infire's activation-memory discipline. Unweight attacks weights; Infire attacks activations. Savings are additive into KV-cache headroom.
Target hardware: NVIDIA H100 (Hopper) only at launch, using wgmma + TMA.
Open-source kernels: systems/unweight-kernels.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — launch post + design + Llama-3.1-8B numbers; honest overhead framing + roadmap.

Systems: systems/workers-ai, systems/infire, systems/nvidia-tensor-core, systems/unweight-kernels, systems/kimi-k2-5, systems/vllm.
Concepts: concepts/lossless-weight-compression, concepts/huffman-coding, concepts/bf16-exponent-redundancy, concepts/memory-bandwidth-bound-inference, concepts/fused-decompression-matmul, concepts/hbm-vs-smem, concepts/quantization, concepts/kv-cache.
Patterns: patterns/fused-decompress-tensor-core-matmul, patterns/autotuned-execution-pipeline-selection, patterns/sm-partitioning-producer-consumer, patterns/upstream-the-fix, patterns/measurement-driven-micro-optimization.
Contrast: patterns/hardware-native-quantization — Dropbox / NVIDIA's lossy-compressed-operands-inside-the-MMA path; trades bit-exactness for higher compression.