Skip to content

CONCEPT Cited by 1 source

Memory-bandwidth-bound inference

Definition

Memory-bandwidth-bound inference is the regime in which per-token latency is gated by bytes moved from HBM to the compute units, not by FLOPs executed once the data arrives. Dominates LLM autoregressive decode on Hopper-class GPUs: generating a single token requires reading every model weight from HBM, yet the tensor cores crunch numbers roughly ~600× faster than HBM can deliver them. The bottleneck is the memory bus, not the math — "every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it matters

In a memory-bandwidth-bound regime the tensor cores sit idle waiting for data most of the time — unused compute capacity is not something you can simply redirect to another task, because each GPU compute unit can only run one kernel at a time (shared memory constraints). Every optimization that cuts bytes-across- the-bus translates linearly into decode throughput. Conversely, any work that adds bytes-across-the-bus directly adds latency, even if it technically runs in idle compute cycles.

Two canonical responses

  1. Quantization (lossy) — fewer bits per weight → fewer bytes across the bus. The dominant industry response; trades model quality for bandwidth.
  2. Lossless compression — compress the weight file so the bytes-per-weight count drops without changing model behaviour. Requires a decompression scheme that hides behind the bus savings — see fused-decompression matmul and Unweight.

Batch-size dependence

At small batch size (single-digit tokens, chat turn) decode is decidedly memory-bandwidth-bound — weights loaded once per token, little arithmetic intensity. At large batch size (256+ tokens) arithmetic intensity climbs (many tokens per weight load), the regime drifts toward compute-bound, and the math starts mattering again. See concepts/memory-bound-vs-compute-bound for the general roofline framing.

Relationship to KV cache

KV cache bytes cross the bus too, and in long-context scenarios the KV cache can exceed weight bytes per token. Every byte Unweight saves on weight memory is a byte of headroom for additional KV cache — tight coupling between weight-compression and context-length-at-scale.

Seen in

Last updated · 200 distilled / 1,178 read