Skip to content

CONCEPT Cited by 1 source

HBM vs. SMEM (GPU memory hierarchy)

Definition

Modern NVIDIA GPUs have two relevant memory tiers for inference kernels:

  • High Bandwidth Memory (HBM) — large (80 GB on H100), but relatively slow to access. Model weights live here.
  • Shared Memory (SMEM) — tiny (228 KB per SM on Hopper), but extremely fast. Where the GPU stages data right before doing math.

During inference, generating each token requires reading the full weight matrix from HBM through the memory bus — the bus is the bottleneck, not the math. "Fewer bytes across the bus = faster token generation." (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why the two-tier structure matters for design

  • Any kernel can do math only on data in SMEM (or registers). Data on HBM has to be pulled in first.
  • HBM bandwidth is the bandwidth-bound bottleneck. Tensor cores are ~600× faster than HBM can deliver; per-token decode loads every weight from HBM once.
  • SMEM size is the hard constraint on kernel design. Bigger pipeline buffers, bigger accumulator tiles, bigger lookup tables all compete for the same 228 KB.

Hopper-specific primitives

  • wgmma — warp-group matrix-multiply-accumulate that reads from SMEM, writes back to registers / SMEM.
  • TMA (Tensor Memory Accelerator) — dedicated hardware for async HBM→SMEM transfers with minimal thread overhead; enables the producer thread group to overlap transfer with the consumer's compute.
  • 228 KB SMEM per SM — the hard budget every Hopper kernel design works inside.

Design implication: decompress in SMEM, not HBM

Naive decompression writes the uncompressed form back to HBM before running matmul → the decompressed bytes traverse the bus again, erasing the compression win. The correct pattern is to load compressed bytes from HBM into SMEM, reconstruct in SMEM, and feed the tensor cores directly — see fused decompression matmul and Unweight's reconstructive kernel.

SMEM budget collisions

Unweight's concrete collision: reconstructive matmul needs ~227 KB of SMEM for pipeline buffer + accumulator tiles; a Huffman decode kernel needs ~16 KB for its lookup table. 227 + 16 > 228 → the two kernels can't share an SM. Every SM assigned to decoding is one fewer SM available to the matmul. The split becomes a tunable parameter the Unweight autotuner measures empirically.

Seen in

Last updated · 200 distilled / 1,178 read