Skip to content

PATTERN Cited by 1 source

SM partitioning (producer / consumer thread groups)

Statement

Inside a single GPU kernel, split thread groups into dedicated producer roles (HBM→SMEM transfer via TMA into a circular buffer) and dedicated consumer roles (compute-from-SMEM feeding tensor cores). Tune the circular-buffer depth and the output-tile width empirically against measured throughput — the right values are workload-dependent.

When it applies

  • Custom CUDA kernel where every cycle of tensor-core feeding matters.
  • Workload benefits from overlapping HBM→SMEM transfer with compute on already-transferred data — i.e. any memory-bandwidth- bound matmul at sufficient size.
  • Target hardware has dedicated async-copy primitives (Hopper TMA; earlier NVIDIA generations have weaker equivalents).

Problem

Single-role thread groups force a choice between:

  • Compute-focused — all threads feed tensor cores; async copy runs on some threads as a side-effect, competing for registers + warp scheduler attention, and transfer parallelism is limited.
  • Transfer-focused — all threads drive TMA; tensor cores starve when data is in-flight.

Neither fully overlaps HBM transfer with SMEM compute; tensor cores end up with idle cycles waiting on data.

Solution

Two dedicated roles within one kernel, cooperating through a circular buffer in SMEM:

  • Producer group — drives HBM→SMEM transfers via TMA into the circular buffer. Runs ahead of the consumer so data is ready before it's needed. In fused-decompression kernels: stages compressed representations (sign+mantissa bytes + exponent data or palette indices + verbatim-row escapes).
  • Consumer groups — compute from SMEM, feed tensor cores via MMA instructions (wgmma on Hopper). In fused- decompression: reconstruct the uncompressed representation in-register/SMEM and immediately pass it to WGMMA.

Autotunable knobs

Two knobs whose optimal values depend on workload:

  • Circular-buffer depth — deeper buffers hide memory latency (useful at small batch where there's less compute to overlap with); shallower buffers free SMEM for accumulator tiles (useful at large batch where more accumulators are needed).
  • Output tile width — wider tiles reuse data better at large batch sizes; narrower better at small.

The kernel ships in multiple variants along these two axes; the autotuner picks the best variant per workload.

Why this is harder than it looks

  • SMEM budget collisions across kernels. On Hopper each SM has 228 KB. The reconstructive matmul already needs ~227 KB for its pipeline buffer + accumulator tiles. Co-running a decode kernel (~16 KB SMEM for lookup table) on the same SM would need 243 KB. Because 227 + 16 > 228, the two kernels cannot share an SM — they compete for SMs at the GPU-wide level. Every SM assigned to decode is an SM unavailable to matmul. The split itself becomes a tunable parameter the autotuner measures empirically.
  • Producer must run ahead of consumer. Starving the consumer wastes tensor-core cycles (the whole point of the split); over-running eats SMEM budget without benefit. Circular- buffer depth tuning manages the lead.
  • Double-buffered preprocess slots. If a separate preprocess kernel is also running (e.g. Huffman decoding the next layer's weights during the current layer's matmul), its output buffer in SMEM must not be overwritten while the matmul is still consuming the previous buffer.

Canonical wiki instance

Unweight's reconstructive matmul kernel (2026-04-17) — producer group loads compressed data via TMA into a circular buffer; consumer groups reconstruct BF16 from exponent + sign + mantissa and feed Hopper's WGMMA. Multiple variants across tile width + buffer depth are autotuned per (projection, batch size).

Trade-offs

  • Complex kernel — warp-group synchronisation, circular- buffer management, SMEM bookkeeping. High engineering cost.
  • Hopper-specific — TMA, WGMMA, the 228 KB SMEM budget are all Hopper specifics. Porting to pre-Hopper or to Blackwell (sm_100 / sm_120) requires significant rework.
  • Starvation failure mode — if the producer gets delayed, consumer stalls on empty buffer slots. Tuning the depth against real workloads is load-bearing.

Seen in

Last updated · 200 distilled / 1,178 read