Skip to content

CONCEPT Cited by 1 source

Fused decompression + matmul

Definition

Fused decompression + matmul is the GPU-kernel pattern in which compressed weights are loaded from HBM into shared memory, reconstructed inside SMEM, and fed directly to the tensor cores (WGMMA on Hopper) without ever materialising the uncompressed representation in main memory. The compressed bytes that cross the HBM→SMEM bus are all the weight-bandwidth cost you pay. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it's the correct pattern

The naive alternative — decompress-to-HBM, then run a standard matmul — helps with storage capacity (fit more models on the GPU) but does not help with bandwidth (every token still pulls the full uncompressed matrix across the bus to load it back into tensor cores). Under bandwidth-bound decode on Hopper, only bytes-across-the-bus counts. Fused decompression puts the reconstruction after the bus transfer, inside the tiny fast SMEM tier, so the bus only ever carries compressed bytes.

Kernel shape

The Hopper realisation in Unweight splits the kernel's thread groups into two roles:

  • Producer — drives HBM→SMEM transfer via TMA into a circular buffer; stages sign+mantissa bytes + exponent palette (or palette indices) + verbatim-row data for rare-exponent rows. Runs ahead of the consumer so data is ready when needed.
  • Consumer — reconstructs BF16 from exponent + sign + mantissa and immediately feeds WGMMA — "the reconstructed weights go straight from assembly to computation without leaving shared memory."

See patterns/sm-partitioning-producer-consumer for the generalised pattern.

Relationship to hardware-native quantization

patterns/hardware-native-quantization (MXFP / block_scale modifier on tcgen05.mma / mma.sync) pushes the dequant step into the MMA instruction itself — the hardware consumes compressed operands directly. That's the lossy sibling of fused- decompression matmul; both share the principle that the uncompressed representation never materialises in HBM, but hardware-native quantization goes one step further (no reconstruction-in-SMEM step at all). Unweight's pattern is software-only (pure CUDA kernel on stock H100, no special hardware support).

The constraint that makes this hard

On Hopper each SM has 228 KB of shared memory. The reconstructive matmul needs ~227 KB for its pipeline buffer + accumulator tiles. A Huffman decode kernel needs ~16 KB for its lookup table. Since 227 + 16 > 228, decode and matmul cannot share the same SM — they compete for SMs at the GPU-wide level. This is why fused decompression for Huffman-compressed weights can't be a single kernel; it has to be two kernels on separate SMs with a tunable split.

Seen in

Last updated · 200 distilled / 1,178 read