CONCEPT Cited by 1 source

Memory-bandwidth-bound inference¶

Definition¶

Memory-bandwidth-bound inference is the regime in which per-token latency is gated by bytes moved from HBM to the compute units, not by FLOPs executed once the data arrives. Dominates LLM autoregressive decode on Hopper-class GPUs: generating a single token requires reading every model weight from HBM, yet the tensor cores crunch numbers roughly ~600× faster than HBM can deliver them. The bottleneck is the memory bus, not the math — "every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller." (Source: sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality)

Why it matters¶

In a memory-bandwidth-bound regime the tensor cores sit idle waiting for data most of the time — unused compute capacity is not something you can simply redirect to another task, because each GPU compute unit can only run one kernel at a time (shared memory constraints). Every optimization that cuts bytes-across- the-bus translates linearly into decode throughput. Conversely, any work that adds bytes-across-the-bus directly adds latency, even if it technically runs in idle compute cycles.

Two canonical responses¶

Quantization (lossy) — fewer bits per weight → fewer bytes across the bus. The dominant industry response; trades model quality for bandwidth.
Lossless compression — compress the weight file so the bytes-per-weight count drops without changing model behaviour. Requires a decompression scheme that hides behind the bus savings — see fused-decompression matmul and Unweight.

Batch-size dependence¶

At small batch size (single-digit tokens, chat turn) decode is decidedly memory-bandwidth-bound — weights loaded once per token, little arithmetic intensity. At large batch size (256+ tokens) arithmetic intensity climbs (many tokens per weight load), the regime drifts toward compute-bound, and the math starts mattering again. See concepts/memory-bound-vs-compute-bound for the general roofline framing.

Relationship to KV cache¶

KV cache bytes cross the bus too, and in long-context scenarios the KV cache can exceed weight bytes per token. Every byte Unweight saves on weight memory is a byte of headroom for additional KV cache — tight coupling between weight-compression and context-length-at-scale.

Seen in¶

sources/2026-04-17-cloudflare-unweight-how-we-compressed-an-llm-22-percent-without-sacrificing-quality — canonical wiki instance; ~600× tensor-core-vs-HBM ratio is the motivating number; "Every byte that crosses the memory bus is a byte that could have been avoided".

concepts/memory-bound-vs-compute-bound — general roofline regime taxonomy; the bandwidth-bound side applied specifically to LLM serving.
concepts/hbm-vs-smem — the two-tier GPU memory hierarchy that makes reconstruction-in-SMEM the correct fused- decompression target.
concepts/kv-cache — the second major per-token byte flow alongside weights.
systems/unweight / systems/infire — weight-side vs activation-memory-side responses to the same bandwidth pressure.
patterns/fused-decompress-tensor-core-matmul — how compression-for-bandwidth actually wins in a kernel.