Unweight: how we compressed an LLM 22% without sacrificing quality¶
Summary¶
Cloudflare introduces Unweight, a lossless compression system for LLM weights that shrinks model footprint by 15–22 % while preserving bit-exact outputs — no hardware changes required. On Llama-3.1-8B, MLP weights compress ~30 %, translating to ~3 GB VRAM saved per model replica. The core engineering idea: decompress weights in fast on-chip shared memory and feed them directly to the tensor cores, avoiding an extra round-trip through slow main memory. A runtime autotuner picks between four execution pipelines per weight matrix per batch size, measuring actual throughput on the target hardware. The technical paper + GPU kernels are both public.
Key takeaways¶
-
LLM inference is memory-bandwidth-bound, not compute-bound. On an H100 the tensor cores can process data ~600× faster than HBM can deliver it; per-token generation requires reading every weight from HBM. "Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." → concepts/memory-bandwidth-bound-inference (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
BF16 exponent bytes are highly redundant. Out of 256 possible exponent values, ~16 exponents cover >99 % of the weights in a typical layer. Information theory says ~2.6 bits suffice to represent the distribution vs the 8 bits allocated. Sign + mantissa look like random data and don't compress; the exponent is the entire savings surface. → concepts/bf16-exponent-redundancy, concepts/huffman-coding (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
Verbatim-row escape for rare exponents. Rather than per-element branching, Unweight groups weights into rows of 64 — if any weight in the row has an exponent outside the top-16 palette, the entire row is stored verbatim. One decision per row, zero branches in the hot path. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
Compressing only MLP weights, not attention / embeddings / layer norms. MLP weights (gate / up / down projections) are ~⅔ of parameters and dominate memory traffic during decode; attention weights + embeddings + layer norms stay uncompressed because compression benefit is marginal there. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
Four execution pipelines, autotuned per workload. Different batch sizes + weight shapes favour different tradeoffs between preprocessing traffic and per-element reconstruction work: full decode (preprocess writes full BF16 back to HBM, hand to cuBLAS), exponent-only decode (halve preprocess traffic), palette transcode (quarter it), direct palette (skip preprocessing entirely, reconstruct inside the matmul kernel). "There's no single best way." The runtime picks per
(weight matrix, batch size). → patterns/autotuned-execution-pipeline-selection (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality) -
The reconstructive matmul. Three of the four pipelines use a custom Hopper-WGMMA kernel that loads compressed data from HBM, reconstructs BF16 in shared memory, feeds the tensor cores — the reconstructed weights never touch main memory. The kernel splits threads into a producer group (TMA-driven HBM→SMEM transfer into a circular buffer) and consumer groups (exponent + sign+mantissa → BF16 → WGMMA). → patterns/fused-decompress-tensor-core-matmul, patterns/sm-partitioning-producer-consumer (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
SM-level shared-memory contention forces a tunable split. On Hopper each SM has 228 KB of SMEM. The reconstructive matmul needs ~227 KB; a Huffman decode kernel needs ~16 KB.
227 + 16 > 228→ the two kernels cannot share an SM; every SM assigned to decoding is an SM unavailable to the matmul. The autotuner measures the optimal split empirically. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality) -
Easy/hard layer alternation hides decompression cost. Unweight classifies layers as "hard" (requires runtime Huffman preprocess) vs "easy" (pre-transcoded palette, no preprocess). While the GPU computes an easy layer, a separate CUDA stream decodes the next hard layer's weights; double-buffered preprocess slots prevent overwrite. The down projection benefits most because it's consumed last in the MLP sequence. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
One compression format, two uses. The same Huffman-encoded bundle serves both distribution (max compression — ~22 % model-size reduction, reduces transfer time when shipping models across the Cloudflare network) and inference (transcoded to the palette intermediate format at load time, enabling the most efficient runtime execution without constraining the distribution format). (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
Honest cost framing: not a free lunch. Llama-3.1-8B results: ~13 % footprint reduction (inference bundles, gate/up only) / ~22 % (distribution bundles, all MLP). Current throughput overhead 30–40 % end-to-end on H100 SXM5: ~41 % at batch 1, narrowing to ~30 % at batch 1024. Three known mitigations under active optimization: small-batch fixed costs, redundant weight-tile reconstruction at large batches, and the still-uncompressed down projection (~⅓ of compressible weights). Extrapolating to Llama-70B: ~18–28 GB saved depending on configuration. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
-
Positioning vs. prior work. ZipNN compresses for distribution
- storage (CPU decode). Huff-LLM proposes custom FPGA decode hardware. ZipServ does fuse decompression with GPU inference but targets consumer-grade GPUs. None deliver lossless inference-time decompression on datacenter Hopper GPUs integrable with a Rust inference engine — "none of these gave us what we needed." Prior work of the general form ~30 % full-model compression exists but targets consumer GPUs + research frameworks that "don't work at production scale." (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
Systems introduced / extended¶
- systems/unweight (new) — Cloudflare's lossless MLP-weight compression system for H100 inference; four-pipeline autotuned execution; Huffman-on-exponent-only; verbatim-row escape; MLP-only scope.
- systems/unweight-kernels (new) — open-source GPU kernels (github.com/cloudflareresearch/unweight-kernels).
- systems/workers-ai — gains Unweight as a memory-footprint-reduction lever on the inference tier that pairs with Infire's activation-memory discipline; enables fitting more models per GPU across the 330-city network.
- systems/infire — Cloudflare's proprietary Rust inference engine; the integration target Unweight plugs into. Prior posts framed Infire as the activation-memory-optimised engine; Unweight attacks the weights side of the same VRAM budget.
- systems/nvidia-tensor-core — canonical instance of the
"tensor cores 600× faster than HBM" memory-bandwidth-bound
framing; Hopper
wgmma+ TMA + SMEM are the concrete primitives Unweight's reconstructive matmul uses. - systems/kimi-k2-5 — referenced as the 1T-param model Workers AI serves on 8× H100 with >30 GiB KV room via Infire; future Unweight targets named in the post roadmap.
- systems/vllm — not directly compared but sits in the prior-art neighbourhood as a reference inference engine.
Concepts introduced / extended¶
- concepts/lossless-weight-compression (new) — the problem class: bit-exact weight reconstruction, distinct from quantization (lossy).
- concepts/huffman-coding (new) — variable-length prefix code assigning short codes to common symbols, long codes to rare ones; the specific entropy-coding primitive Unweight applies to BF16 exponent bytes.
- concepts/bf16-exponent-redundancy (new) — empirical fact that BF16 exponent distributions in trained LLMs are sharply skewed; the physical basis for Unweight's ~30 % MLP-exponent compression.
- concepts/memory-bandwidth-bound-inference (new) — the regime where per-token latency is gated by HBM→SMEM bytes, not FLOPs; dominates LLM decode on Hopper-class GPUs.
- concepts/fused-decompression-matmul (new) — loading compressed weights into SMEM, reconstructing there, feeding tensor cores without a round-trip through HBM.
- concepts/hbm-vs-smem (new) — the two-tier GPU memory hierarchy (large+slow HBM vs tiny+fast SMEM) that makes fused-decompression the correct pattern rather than decompress-to-HBM-then-matmul.
- concepts/quantization — extended with a "lossless alternative" Related reference framing. Unweight is explicitly not quantization ("different 16-bit floating point values can be converted to the same 4-bit integer" — the failure mode Cloudflare wanted to avoid for "production inference serving diverse use cases").
- concepts/kv-cache — referenced as the other GPU-memory consumer alongside weights; Unweight's VRAM savings translate directly into more KV-cache room (canonical example: 2× H200 fits Llama 4 Scout with >56 GiB KV room, cited in the prior Infire post).
Patterns introduced / extended¶
- patterns/fused-decompress-tensor-core-matmul (new) — custom GPU kernel that loads compressed data, reconstructs the uncompressed representation in SMEM, feeds tensor cores — the reconstructed representation never touches HBM. Producer / consumer thread-group split inside the kernel.
- patterns/autotuned-execution-pipeline-selection (new) —
rather than picking one compressed-execution strategy, offer a
spectrum (full decode / exponent-only / palette / direct palette)
and let a runtime autotuner sweep candidate configurations per
(weight matrix, batch size)informed by measured end-to-end throughput on the target hardware. Mirrors measurement- driven micro-optimization at the kernel-selection grain. - patterns/sm-partitioning-producer-consumer (new) — inside a single GPU kernel, partition SMs (or thread groups within an SM) into dedicated producers (drive HBM→SMEM transfers via TMA into a circular buffer) and consumers (compute from SMEM). Depth of the circular buffer is itself an autotunable knob (wider tiles reuse data better at large batch; deeper buffers hide memory latency at small batch).
- patterns/upstream-the-fix — extended with a "open-source the specialised kernels" instance: Unweight ships both a technical paper and the CUDA kernels, explicitly framing this as "contributing to a growing corpus of research in compression and GPU efficiency" — same posture as the 2025-10 V8 / OpenNext / Node.js upstream-PR instance.
Operational numbers¶
| Metric | Value | Notes |
|---|---|---|
| MLP weight compression (Llama-3.1-8B) | ~30 % | Exponent byte only |
| Overall model size reduction — inference bundle | ~13 % | gate + up MLP projections only |
| Overall model size reduction — distribution bundle | ~22 % | all MLP projections (gate/up/down) |
| Absolute VRAM saved (Llama-3.1-8B) | ~3 GB | Per instance |
| Extrapolated saved on Llama-70B | ~18–28 GB | Config-dependent |
| Top-k exponents covering >99 % of weights | 16 / 256 | BF16 8-bit exponent |
| Information-theoretic bits per exponent | ~2.6 | Vs 8 allocated |
| Row escape granularity | 64 weights | Whole-row verbatim on rare exponent |
| H100 tensor-core : HBM speed ratio | ~600× | Motivating the bandwidth-bound framing |
| Hopper SM shared memory | 228 KB | Fixed |
| Reconstructive matmul SMEM requirement | ~227 KB | Pipeline buffer + accumulator tiles |
| Huffman decode kernel SMEM requirement | ~16 KB | Lookup table |
| Throughput overhead (batch 1, H100 SXM5) | ~41 % | Current, being optimised |
| Throughput overhead (batch 1024, H100 SXM5) | ~30 % | Narrows at larger batches |
| Hardware target | NVIDIA H100 (Hopper) | wgmma + TMA + WGMMA tensor cores |
Caveats¶
- Throughput cost is real. 30–40 % end-to-end overhead at the current optimization level. Cloudflare frames this honestly and projects narrowing with named mitigations (small-batch fixed cost reduction, redundant reconstruction elimination at large batches, down-projection compression added). Still-higher cost than an uncompressed run.
- MLP-only. Attention + embeddings + layer norms stay uncompressed; the 22 % cap is a fraction of compressible weights, not the full model. Compression % will creep upward as down projection gets added + possibly attention later, but it won't approach 50 %.
- H100-only kernels today. Hopper
wgmma+ TMA + WGMMA are used directly; porting to Blackwell or AMD MI-series is future work. Llama-3.1-8B is the only measured model. - Quality guarantee is bit-exact, measured by construction. The compression is lossless by design (Huffman-on-exponent + verbatim-row escape is information-theoretically reversible); no quality-regression benchmark disclosed because by construction there shouldn't be one.
- No competitive measurement. vs ZipNN / Huff-LLM / ZipServ is explicitly qualitative (scope-and-target differences), not benchmarked side-by-side.
- SwiGLU-architecture specificity. The "exponent statistics are consistent across model scales" claim is framed for SwiGLU-family models. Non-SwiGLU architectures are projected to generalise but not tested.
Source¶
- Original: https://blog.cloudflare.com/unweight-tensor-compression/
- Technical paper: research.cloudflare.com/nikulin2026
- GPU kernels (open source): github.com/cloudflareresearch/unweight-kernels
- Raw markdown:
raw/cloudflare/2026-04-17-unweight-how-we-compressed-an-llm-22-without-sacrificing-qua-2d17c53a.md
Related¶
- systems/unweight / systems/unweight-kernels — the system + open-source kernels this post launches.
- systems/workers-ai / systems/infire — the inference tier + engine Unweight plugs into.
- systems/nvidia-tensor-core — hardware substrate; the tensor-core-vs-HBM speed ratio is the economic engine.
- concepts/memory-bandwidth-bound-inference — why compression of any kind wins at all on Hopper-class GPUs for LLM decode.
- concepts/huffman-coding / concepts/bf16-exponent-redundancy — the entropy-coding primitive + the distributional fact that makes it win.
- concepts/fused-decompression-matmul / concepts/hbm-vs-smem — the GPU-memory-hierarchy argument for reconstructing in SMEM.
- concepts/quantization — lossy sibling Unweight is explicitly not; contrasts with hardware-native quantization which also consumes compressed operands inside the tensor-core instruction but trades bit-exactness for higher compression.
- patterns/fused-decompress-tensor-core-matmul / patterns/autotuned-execution-pipeline-selection / patterns/sm-partitioning-producer-consumer — the three new kernel-engineering patterns.
- patterns/upstream-the-fix — the "ship the paper + kernels public" contribution posture.
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — prior Workers AI serving-architecture post; Infire's activation-memory discipline is the complementary lever Unweight plugs alongside.
- companies/cloudflare — recurring hot-path performance engineering shape: flame-graph-driven micro-optimization, open the source, publish the methodology.