Skip to content

SYSTEM Cited by 1 source

Unweight

Definition

Unweight is Cloudflare's lossless LLM weight compression system for NVIDIA H100 inference. It shrinks model footprint 15–22 % while preserving bit-exact outputs, without special hardware. The core technique is Huffman coding of the redundant BF16 exponent byte paired with a custom CUDA kernel that reconstructs weights in on-chip shared memory and feeds them directly to tensor cores — avoiding a round-trip through slow HBM. Announced 2026-04-17; GPU kernels open-sourced as unweight-kernels. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it exists

On an H100 the tensor cores can process data ~600× faster than HBM can deliver it. LLM token generation requires reading every weight from HBM per token → the binding resource is memory bandwidth, not compute. "Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." Compression of weights thus buys directly into decode throughput if the decompression overhead is hidden.

Prior work (ZipNN, Huff-LLM, ZipServ) fell short for Cloudflare's use case: CPU-side decompression, FPGA hardware, or consumer-grade-GPU targeting. None shipped lossless, inference-time, Hopper-GPU, Rust-engine-integrable.

Design

Compression format

  • Target: the 8-bit exponent byte of BF16 weights. Sign + mantissa look random and don't compress. (concepts/bf16-exponent-redundancy)
  • Coding: Huffman on exponents. Top-16 exponents cover >99 % of the distribution; information-theoretic floor ~2.6 bits vs 8 allocated → ~30 % exponent compression.
  • Row-level escape for rare exponents: weights grouped into rows of 64; any row whose weights include an out-of-palette exponent is stored verbatim. No per-element branching in the hot path — one decision per row.
  • MLP-only: Unweight compresses gate / up / down MLP projections (~⅔ of model parameters, dominating decode memory traffic). Attention weights, embeddings, layer norms stay uncompressed.

Four execution pipelines

The runtime picks between four paths per (weight matrix, batch size):

Pipeline Preprocess step Matmul kernel Preprocess HBM writes Matmul complexity
Full decode Huffman → full BF16 in HBM stock cuBLAS largest simplest
Exponent-only Exponent → BF16 exponent in HBM reconstructive ½ of full medium
Palette transcode Transcode to 4-bit palette index in HBM reconstructive ¼ of full medium
Direct palette none (4-bit palette pre-baked at load) reconstructive 0 heaviest

Trade-off: less preprocessing → less data written to HBM → bus freed sooner, but more reconstruction work falls on the matmul. At small batches cuBLAS's low overhead + tiny matmul favour full decode. At large batches (256+ tokens) reconstruction overlaps with the big matmul and palette / exponent pipelines pull ahead. Different matrices within the same layer (gate / up / down have different shapes) can favour different pipelines. "There's no single best way." (patterns/autotuned-execution-pipeline-selection)

Reconstructive matmul kernel

Three of the four pipelines use a custom Hopper-WGMMA kernel:

  1. Load compressed data from HBM to SMEM (via TMA).
  2. Reconstruct BF16 in SMEM.
  3. Feed the tensor cores (WGMMA) directly — the reconstructed weights never exist in main memory. (concepts/fused-decompression-matmul, concepts/hbm-vs-smem)

Thread groups inside the kernel split into:

  • Producer group — drives HBM→SMEM transfer via TMA into a circular buffer; stages sign+mantissa bytes, exponent data (or palette indices), and verbatim-row data for rare-exponent rows.
  • Consumer groups — reconstruct BF16 from exponent-plus-sign-plus-mantissa and feed WGMMA.

Variants differ in output tile width (wider tiles reuse data better at large batch) and circular-buffer depth (deeper buffers hide memory latency at small batch). Autotuner picks the variant. (patterns/sm-partitioning-producer-consumer)

SM-level shared-memory contention

On Hopper each SM has 228 KB of SMEM. The reconstructive matmul needs ~227 KB for pipeline buffer + accumulator tiles. A Huffman decode kernel needs ~16 KB for lookup table. Since 227 + 16 > 228, the two kernels cannot share the same SM. Every SM assigned to decoding is one fewer SM available to the matmul — the split is a tunable parameter the autotuner measures empirically.

Easy / hard layer alternation

Layers classify as:

  • Hard — requires Huffman preprocessing at runtime.
  • Easy — pre-transcoded palette at model load; matmul consumes directly.

The runtime alternates: while the GPU computes an easy layer (no preprocess needed), a separate CUDA stream decodes the next hard layer's weights. Double-buffered preprocess slots prevent overwrite while a slot's output is still being consumed. The down projection benefits most — consumed last in the MLP sequence (after gate, activation, up), so its decode has the longest runway.

One format, two uses

The same Huffman-compressed bundle serves:

  • Distribution — maximum compression (~22 % model-size reduction), reduces transfer time shipping models across the Cloudflare network.
  • Inference — Huffman-encoded projections transcoded to the palette intermediate format at model load, enabling the most efficient runtime execution without constraining the distribution format.

The runtime still selects per-projection per-batch-size on the fly.

Measured results (Llama-3.1-8B, H100 SXM5)

  • Model footprint reduction: ~13 % (inference, gate/up only) / ~22 % (distribution, all MLP).
  • Absolute saved: ~3 GB per model instance. Extrapolated ~18–28 GB on Llama-70B depending on configuration.
  • Throughput overhead: ~41 % at batch 1, narrowing to ~30 % at batch 1024.
  • Bit-exact lossless by construction.

Three known sources of the overhead, all under active optimization:

  1. Small-batch fixed costs in the reconstructive matmul.
  2. Redundant weight-tile reconstruction at large batch sizes.
  3. Down projection still uncompressed (~⅓ of compressible weights).

Positioning vs. prior work

  • ZipNN: distribution + storage only; CPU decode.
  • Huff-LLM: proposes custom FPGA decode hardware.
  • ZipServ: fuses decompression with GPU inference but targets consumer-grade GPUs, "which don't work with our H100 GPUs".
  • Unweight: lossless inference-time decompression on Hopper datacenter GPUs, Rust-inference-engine integrable, four adaptive pipelines per batch size, ships open-source kernels + a technical paper.

Roadmap

  • Down projection compression — different kernel variant due to transposed dimensions; closes the last ⅓ of compressible weights.
  • Kernel optimization — small-batch fixed costs; redundant weight-tile reconstruction at large batches.
  • More models — Llama-3.1-8B → larger Workers-AI-served models; SwiGLU-architecture exponent statistics consistent across scales.
  • Mixture-of-Experts — where cold experts are fetched on demand, reduced storage would reduce fetch cost further.

Integration

  • Runs on Workers AI as a VRAM-reduction lever beside Infire's activation-memory discipline. Unweight attacks weights; Infire attacks activations. Savings are additive into KV-cache headroom.
  • Target hardware: NVIDIA H100 (Hopper) only at launch, using wgmma + TMA.
  • Open-source kernels: systems/unweight-kernels.

Seen in

Last updated · 200 distilled / 1,178 read