PATTERN Cited by 1 source
SM partitioning (producer / consumer thread groups)¶
Statement¶
Inside a single GPU kernel, split thread groups into dedicated producer roles (HBM→SMEM transfer via TMA into a circular buffer) and dedicated consumer roles (compute-from-SMEM feeding tensor cores). Tune the circular-buffer depth and the output-tile width empirically against measured throughput — the right values are workload-dependent.
When it applies¶
- Custom CUDA kernel where every cycle of tensor-core feeding matters.
- Workload benefits from overlapping HBM→SMEM transfer with compute on already-transferred data — i.e. any memory-bandwidth- bound matmul at sufficient size.
- Target hardware has dedicated async-copy primitives (Hopper TMA; earlier NVIDIA generations have weaker equivalents).
Problem¶
Single-role thread groups force a choice between:
- Compute-focused — all threads feed tensor cores; async copy runs on some threads as a side-effect, competing for registers + warp scheduler attention, and transfer parallelism is limited.
- Transfer-focused — all threads drive TMA; tensor cores starve when data is in-flight.
Neither fully overlaps HBM transfer with SMEM compute; tensor cores end up with idle cycles waiting on data.
Solution¶
Two dedicated roles within one kernel, cooperating through a circular buffer in SMEM:
- Producer group — drives HBM→SMEM transfers via TMA into the circular buffer. Runs ahead of the consumer so data is ready before it's needed. In fused-decompression kernels: stages compressed representations (sign+mantissa bytes + exponent data or palette indices + verbatim-row escapes).
- Consumer groups — compute from SMEM, feed tensor cores
via MMA instructions (
wgmmaon Hopper). In fused- decompression: reconstruct the uncompressed representation in-register/SMEM and immediately pass it to WGMMA.
Autotunable knobs¶
Two knobs whose optimal values depend on workload:
- Circular-buffer depth — deeper buffers hide memory latency (useful at small batch where there's less compute to overlap with); shallower buffers free SMEM for accumulator tiles (useful at large batch where more accumulators are needed).
- Output tile width — wider tiles reuse data better at large batch sizes; narrower better at small.
The kernel ships in multiple variants along these two axes; the autotuner picks the best variant per workload.
Why this is harder than it looks¶
- SMEM budget collisions across kernels. On Hopper each SM
has 228 KB. The reconstructive matmul already needs
~227 KB for its pipeline buffer + accumulator tiles.
Co-running a decode kernel (~16 KB SMEM for lookup table)
on the same SM would need 243 KB. Because
227 + 16 > 228, the two kernels cannot share an SM — they compete for SMs at the GPU-wide level. Every SM assigned to decode is an SM unavailable to matmul. The split itself becomes a tunable parameter the autotuner measures empirically. - Producer must run ahead of consumer. Starving the consumer wastes tensor-core cycles (the whole point of the split); over-running eats SMEM budget without benefit. Circular- buffer depth tuning manages the lead.
- Double-buffered preprocess slots. If a separate preprocess kernel is also running (e.g. Huffman decoding the next layer's weights during the current layer's matmul), its output buffer in SMEM must not be overwritten while the matmul is still consuming the previous buffer.
Canonical wiki instance¶
Unweight's reconstructive matmul kernel (2026-04-17) — producer group loads compressed data via TMA into a circular buffer; consumer groups reconstruct BF16 from exponent + sign + mantissa and feed Hopper's WGMMA. Multiple variants across tile width + buffer depth are autotuned per (projection, batch size).
Related hardware context¶
- Hopper tensor cores — WGMMA instructions consume from SMEM.
- concepts/hbm-vs-smem — the two-tier memory hierarchy that makes this pattern necessary.
- TMA (Tensor Memory Accelerator) — Hopper's async HBM→SMEM hardware that makes the producer role cheap.
Trade-offs¶
- Complex kernel — warp-group synchronisation, circular- buffer management, SMEM bookkeeping. High engineering cost.
- Hopper-specific — TMA, WGMMA, the 228 KB SMEM budget are
all Hopper specifics. Porting to pre-Hopper or to Blackwell
(
sm_100/sm_120) requires significant rework. - Starvation failure mode — if the producer gets delayed, consumer stalls on empty buffer slots. Tuning the depth against real workloads is load-bearing.
Seen in¶
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance; explicit producer / consumer role description + WGMMA feed + circular buffer + autotuned variants.
Related¶
- systems/unweight / systems/unweight-kernels — production deployment.
- patterns/fused-decompress-tensor-core-matmul — the higher-level kernel shape this pattern realises.
- patterns/autotuned-execution-pipeline-selection — the pattern that picks which variant of this kernel to run.
- concepts/hbm-vs-smem — hardware context.
- systems/nvidia-tensor-core — the consumer target.