CONCEPT Cited by 1 source
Memory-bandwidth-bound inference¶
Definition¶
Memory-bandwidth-bound inference is the regime in which per-token latency is gated by bytes moved from HBM to the compute units, not by FLOPs executed once the data arrives. Dominates LLM autoregressive decode on Hopper-class GPUs: generating a single token requires reading every model weight from HBM, yet the tensor cores crunch numbers roughly ~600× faster than HBM can deliver them. The bottleneck is the memory bus, not the math — "every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)
Why it matters¶
In a memory-bandwidth-bound regime the tensor cores sit idle waiting for data most of the time — unused compute capacity is not something you can simply redirect to another task, because each GPU compute unit can only run one kernel at a time (shared memory constraints). Every optimization that cuts bytes-across- the-bus translates linearly into decode throughput. Conversely, any work that adds bytes-across-the-bus directly adds latency, even if it technically runs in idle compute cycles.
Two canonical responses¶
- Quantization (lossy) — fewer bits per weight → fewer bytes across the bus. The dominant industry response; trades model quality for bandwidth.
- Lossless compression — compress the weight file so the bytes-per-weight count drops without changing model behaviour. Requires a decompression scheme that hides behind the bus savings — see fused-decompression matmul and Unweight.
Batch-size dependence¶
At small batch size (single-digit tokens, chat turn) decode is decidedly memory-bandwidth-bound — weights loaded once per token, little arithmetic intensity. At large batch size (256+ tokens) arithmetic intensity climbs (many tokens per weight load), the regime drifts toward compute-bound, and the math starts mattering again. See concepts/memory-bound-vs-compute-bound for the general roofline framing.
Relationship to KV cache¶
KV cache bytes cross the bus too, and in long-context scenarios the KV cache can exceed weight bytes per token. Every byte Unweight saves on weight memory is a byte of headroom for additional KV cache — tight coupling between weight-compression and context-length-at-scale.
Seen in¶
- sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance; ~600× tensor-core-vs-HBM ratio is the motivating number; "Every byte that crosses the memory bus is a byte that could have been avoided".
Related¶
- concepts/memory-bound-vs-compute-bound — general roofline regime taxonomy; the bandwidth-bound side applied specifically to LLM serving.
- concepts/hbm-vs-smem — the two-tier GPU memory hierarchy that makes reconstruction-in-SMEM the correct fused- decompression target.
- concepts/kv-cache — the second major per-token byte flow alongside weights.
- systems/unweight / systems/infire — weight-side vs activation-memory-side responses to the same bandwidth pressure.
- patterns/fused-decompress-tensor-core-matmul — how compression-for-bandwidth actually wins in a kernel.