CONCEPT Cited by 1 source

Fused decompression + matmul¶

Definition¶

Fused decompression + matmul is the GPU-kernel pattern in which compressed weights are loaded from HBM into shared memory, reconstructed inside SMEM, and fed directly to the tensor cores (WGMMA on Hopper) without ever materialising the uncompressed representation in main memory. The compressed bytes that cross the HBM→SMEM bus are all the weight-bandwidth cost you pay. (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it's the correct pattern¶

The naive alternative — decompress-to-HBM, then run a standard matmul — helps with storage capacity (fit more models on the GPU) but does not help with bandwidth (every token still pulls the full uncompressed matrix across the bus to load it back into tensor cores). Under bandwidth-bound decode on Hopper, only bytes-across-the-bus counts. Fused decompression puts the reconstruction after the bus transfer, inside the tiny fast SMEM tier, so the bus only ever carries compressed bytes.

Kernel shape¶

The Hopper realisation in Unweight splits the kernel's thread groups into two roles:

Producer — drives HBM→SMEM transfer via TMA into a circular buffer; stages sign+mantissa bytes + exponent palette (or palette indices) + verbatim-row data for rare-exponent rows. Runs ahead of the consumer so data is ready when needed.
Consumer — reconstructs BF16 from exponent + sign + mantissa and immediately feeds WGMMA — "the reconstructed weights go straight from assembly to computation without leaving shared memory."

See patterns/sm-partitioning-producer-consumer for the generalised pattern.

Relationship to hardware-native quantization¶

patterns/hardware-native-quantization (MXFP / block_scale modifier on tcgen05.mma / mma.sync) pushes the dequant step into the MMA instruction itself — the hardware consumes compressed operands directly. That's the lossy sibling of fused- decompression matmul; both share the principle that the uncompressed representation never materialises in HBM, but hardware-native quantization goes one step further (no reconstruction-in-SMEM step at all). Unweight's pattern is software-only (pure CUDA kernel on stock H100, no special hardware support).

The constraint that makes this hard¶

On Hopper each SM has 228 KB of shared memory. The reconstructive matmul needs ~227 KB for its pipeline buffer + accumulator tiles. A Huffman decode kernel needs ~16 KB for its lookup table. Since 227 + 16 > 228, decode and matmul cannot share the same SM — they compete for SMs at the GPU-wide level. This is why fused decompression for Huffman-compressed weights can't be a single kernel; it has to be two kernels on separate SMs with a tunable split.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance; diagram "Traditional decompression vs Unweight" showing ~30 % fewer bytes cross the memory bus for MLP weight matrices.

concepts/hbm-vs-smem — the two-tier memory hierarchy this pattern exploits.
concepts/memory-bandwidth-bound-inference — why the pattern wins at all.
concepts/lossless-weight-compression — the problem class.
systems/unweight / systems/unweight-kernels — the production deployment + open-source kernels.
patterns/fused-decompress-tensor-core-matmul — pattern page for this kernel shape.
patterns/sm-partitioning-producer-consumer — the producer/consumer thread-group split inside the kernel.
patterns/hardware-native-quantization — adjacent lossy path where the hardware consumes compressed operands directly.