Skip to content

CLOUDFLARE 2026-04-17 Tier 1

Read original ↗

Unweight: how we compressed an LLM 22% without sacrificing quality

Summary

Cloudflare introduces Unweight, a lossless compression system for LLM weights that shrinks model footprint by 15–22 % while preserving bit-exact outputs — no hardware changes required. On Llama-3.1-8B, MLP weights compress ~30 %, translating to ~3 GB VRAM saved per model replica. The core engineering idea: decompress weights in fast on-chip shared memory and feed them directly to the tensor cores, avoiding an extra round-trip through slow main memory. A runtime autotuner picks between four execution pipelines per weight matrix per batch size, measuring actual throughput on the target hardware. The technical paper + GPU kernels are both public.

Key takeaways

Systems introduced / extended

  • systems/unweight (new) — Cloudflare's lossless MLP-weight compression system for H100 inference; four-pipeline autotuned execution; Huffman-on-exponent-only; verbatim-row escape; MLP-only scope.
  • systems/unweight-kernels (new) — open-source GPU kernels (github.com/cloudflareresearch/unweight-kernels).
  • systems/workers-ai — gains Unweight as a memory-footprint-reduction lever on the inference tier that pairs with Infire's activation-memory discipline; enables fitting more models per GPU across the 330-city network.
  • systems/infire — Cloudflare's proprietary Rust inference engine; the integration target Unweight plugs into. Prior posts framed Infire as the activation-memory-optimised engine; Unweight attacks the weights side of the same VRAM budget.
  • systems/nvidia-tensor-core — canonical instance of the "tensor cores 600× faster than HBM" memory-bandwidth-bound framing; Hopper wgmma + TMA + SMEM are the concrete primitives Unweight's reconstructive matmul uses.
  • systems/kimi-k2-5 — referenced as the 1T-param model Workers AI serves on 8× H100 with >30 GiB KV room via Infire; future Unweight targets named in the post roadmap.
  • systems/vllm — not directly compared but sits in the prior-art neighbourhood as a reference inference engine.

Concepts introduced / extended

  • concepts/lossless-weight-compression (new) — the problem class: bit-exact weight reconstruction, distinct from quantization (lossy).
  • concepts/huffman-coding (new) — variable-length prefix code assigning short codes to common symbols, long codes to rare ones; the specific entropy-coding primitive Unweight applies to BF16 exponent bytes.
  • concepts/bf16-exponent-redundancy (new) — empirical fact that BF16 exponent distributions in trained LLMs are sharply skewed; the physical basis for Unweight's ~30 % MLP-exponent compression.
  • concepts/memory-bandwidth-bound-inference (new) — the regime where per-token latency is gated by HBM→SMEM bytes, not FLOPs; dominates LLM decode on Hopper-class GPUs.
  • concepts/fused-decompression-matmul (new) — loading compressed weights into SMEM, reconstructing there, feeding tensor cores without a round-trip through HBM.
  • concepts/hbm-vs-smem (new) — the two-tier GPU memory hierarchy (large+slow HBM vs tiny+fast SMEM) that makes fused-decompression the correct pattern rather than decompress-to-HBM-then-matmul.
  • concepts/quantization — extended with a "lossless alternative" Related reference framing. Unweight is explicitly not quantization ("different 16-bit floating point values can be converted to the same 4-bit integer" — the failure mode Cloudflare wanted to avoid for "production inference serving diverse use cases").
  • concepts/kv-cache — referenced as the other GPU-memory consumer alongside weights; Unweight's VRAM savings translate directly into more KV-cache room (canonical example: 2× H200 fits Llama 4 Scout with >56 GiB KV room, cited in the prior Infire post).

Patterns introduced / extended

  • patterns/fused-decompress-tensor-core-matmul (new) — custom GPU kernel that loads compressed data, reconstructs the uncompressed representation in SMEM, feeds tensor cores — the reconstructed representation never touches HBM. Producer / consumer thread-group split inside the kernel.
  • patterns/autotuned-execution-pipeline-selection (new) — rather than picking one compressed-execution strategy, offer a spectrum (full decode / exponent-only / palette / direct palette) and let a runtime autotuner sweep candidate configurations per (weight matrix, batch size) informed by measured end-to-end throughput on the target hardware. Mirrors measurement- driven micro-optimization at the kernel-selection grain.
  • patterns/sm-partitioning-producer-consumer (new) — inside a single GPU kernel, partition SMs (or thread groups within an SM) into dedicated producers (drive HBM→SMEM transfers via TMA into a circular buffer) and consumers (compute from SMEM). Depth of the circular buffer is itself an autotunable knob (wider tiles reuse data better at large batch; deeper buffers hide memory latency at small batch).
  • patterns/upstream-the-fix — extended with a "open-source the specialised kernels" instance: Unweight ships both a technical paper and the CUDA kernels, explicitly framing this as "contributing to a growing corpus of research in compression and GPU efficiency" — same posture as the 2025-10 V8 / OpenNext / Node.js upstream-PR instance.

Operational numbers

Metric Value Notes
MLP weight compression (Llama-3.1-8B) ~30 % Exponent byte only
Overall model size reduction — inference bundle ~13 % gate + up MLP projections only
Overall model size reduction — distribution bundle ~22 % all MLP projections (gate/up/down)
Absolute VRAM saved (Llama-3.1-8B) ~3 GB Per instance
Extrapolated saved on Llama-70B ~18–28 GB Config-dependent
Top-k exponents covering >99 % of weights 16 / 256 BF16 8-bit exponent
Information-theoretic bits per exponent ~2.6 Vs 8 allocated
Row escape granularity 64 weights Whole-row verbatim on rare exponent
H100 tensor-core : HBM speed ratio ~600× Motivating the bandwidth-bound framing
Hopper SM shared memory 228 KB Fixed
Reconstructive matmul SMEM requirement ~227 KB Pipeline buffer + accumulator tiles
Huffman decode kernel SMEM requirement ~16 KB Lookup table
Throughput overhead (batch 1, H100 SXM5) ~41 % Current, being optimised
Throughput overhead (batch 1024, H100 SXM5) ~30 % Narrows at larger batches
Hardware target NVIDIA H100 (Hopper) wgmma + TMA + WGMMA tensor cores

Caveats

  • Throughput cost is real. 30–40 % end-to-end overhead at the current optimization level. Cloudflare frames this honestly and projects narrowing with named mitigations (small-batch fixed cost reduction, redundant reconstruction elimination at large batches, down-projection compression added). Still-higher cost than an uncompressed run.
  • MLP-only. Attention + embeddings + layer norms stay uncompressed; the 22 % cap is a fraction of compressible weights, not the full model. Compression % will creep upward as down projection gets added + possibly attention later, but it won't approach 50 %.
  • H100-only kernels today. Hopper wgmma + TMA + WGMMA are used directly; porting to Blackwell or AMD MI-series is future work. Llama-3.1-8B is the only measured model.
  • Quality guarantee is bit-exact, measured by construction. The compression is lossless by design (Huffman-on-exponent + verbatim-row escape is information-theoretically reversible); no quality-regression benchmark disclosed because by construction there shouldn't be one.
  • No competitive measurement. vs ZipNN / Huff-LLM / ZipServ is explicitly qualitative (scope-and-target differences), not benchmarked side-by-side.
  • SwiGLU-architecture specificity. The "exponent statistics are consistent across model scales" claim is framed for SwiGLU-family models. Non-SwiGLU architectures are projected to generalise but not tested.

Source

Last updated · 200 distilled / 1,178 read