PATTERN Cited by 1 source

Fused decompress + tensor-core matmul¶

Statement¶

When weight compression buys memory-bandwidth headroom on a bandwidth-bound inference workload, do not decompress the weights back to HBM before matmul. Instead, load compressed bytes from HBM to SMEM, reconstruct the uncompressed form inside SMEM, and feed the tensor cores directly. The reconstructed weights must never touch main memory.

When it applies¶

LLM (or any large-matmul) inference that is memory-bandwidth- bound on GPU (single-digit-batch decode is the canonical case on Hopper-class GPUs).
A lossless or lossy weight-compression scheme exists for the model in question (e.g. Huffman on BF16 exponents for Unweight; MXFP micro-scaling for hardware-native quantization).
The kernel author controls the full compute pipeline (custom CUDA) and can integrate reconstruction directly into the matmul kernel. Requires access to SMEM management + tensor- core instructions + async HBM→SMEM transfer primitives (TMA on Hopper).

Problem¶

Naive decompression runs as a separate preprocess kernel: HBM→kernel A→HBM (uncompressed)→kernel B (stock cuBLAS matmul)→ HBM. The uncompressed weights traverse the bus twice — once out, once in — completely erasing the compression benefit on bandwidth.

From Cloudflare's framing: "Most prior work decompresses entire weight matrices back into HBM, then runs a standard matrix multiplication. This helps with storage capacity but doesn't help with bandwidth because you still read the full uncompressed matrix from HBM for every token."

Solution¶

Reconstruct inside the tiny-but-fast SMEM tier. Compressed bytes cross the bus once. The kernel splits into two cooperating thread-group roles:

Producer — drives HBM→SMEM transfers via TMA into a circular buffer. Stages the compressed representation + any verbatim-row escapes ahead of the consumer.
Consumer — reconstructs BF16 from compressed fields (e.g. sign+mantissa bytes combined with decoded exponents) in SMEM and immediately feeds tensor cores (WGMMA on Hopper).

See patterns/sm-partitioning-producer-consumer for the generalised producer/consumer thread-group split.

Autotunable knobs¶

Output tile width — wider tiles reuse data better at large batch sizes; narrower better at small.
Circular buffer depth — deeper hides memory latency at small batch; shallower frees SMEM for accumulator tiles at large batch.
Which compressed representation to consume — full BF16, exponent-only, palette-4-bit; each shifts the preprocessing-vs-matmul-work trade-off. See patterns/autotuned-execution-pipeline-selection.

Hopper-specific implementation¶

Unweight's reconstructive matmul uses:

wgmma warp-group matrix-multiply-accumulate for tensor-core feed.
TMA (Tensor Memory Accelerator) for async HBM→SMEM.
Circular buffer in SMEM shared between producer and consumer.
Consumer reconstructs exponent byte in SMEM and combines with sign+mantissa bytes to produce BF16 directly.

Hardware-native cousin¶

patterns/hardware-native-quantization (MXFP + block_scale modifier on tcgen05.mma / mma.sync) goes a step further: the MMA instruction itself consumes compressed operands, so even the reconstruction-in-SMEM step disappears. The Hopper-era software equivalent is this fused-decompress + matmul pattern. Blackwell sm_100 / sm_120 support MXFP natively, so hardware-native is the trajectory if bit-exactness isn't required.

Canonical wiki instance¶

Unweight (2026-04-17) — the reconstructive matmul kernel is the central architectural artifact of the post. Kernels open-sourced as systems/unweight-kernels. ~30 % fewer bytes cross the bus for MLP weight matrices on Llama-3.1-8B.

Trade-offs¶

Engineering complexity — two-role thread group split, SMEM budget management, TMA coordination, circular-buffer double-buffering. Not trivial.
Portable to Hopper only at launch — Blackwell sm_100 + sm_120 have different MMA instructions; kernels compiled for one are not portable to the other.
Can't share an SM with a co-running decode kernel if the SMEM budget collides — Unweight's 227 + 16 > 228 KB collision means Huffman decode and reconstructive matmul sit on different SMs, with the split autotuned.
Measured throughput overhead still significant — 30–40 % end-to-end on H100 SXM5 at Unweight's current optimization level, even with the pattern applied correctly.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance; producer / consumer split + circular buffer + WGMMA tensor-core feed + ~30 % bus-byte reduction.

concepts/fused-decompression-matmul — concept page for the same pattern.
concepts/hbm-vs-smem — memory hierarchy the pattern exploits.
concepts/memory-bandwidth-bound-inference — regime where the pattern wins.
systems/unweight / systems/unweight-kernels — production instance.
systems/nvidia-tensor-core — hardware target.
patterns/sm-partitioning-producer-consumer — thread-group-split sub-pattern used inside the fused kernel.
patterns/autotuned-execution-pipeline-selection — higher-level pattern that selects which compressed representation to feed into the fused kernel per workload.
patterns/hardware-native-quantization — lossy sibling pattern where the MMA instruction consumes compressed operands directly.