CONCEPT Cited by 1 source

Memory-bound vs compute-bound (GPU inference)¶

Definition¶

A workload is memory-bound when its performance is limited by how fast data can be moved between GPU memory (HBM) and the compute units — latency per operation is dominated by memory-load time, not by arithmetic. It is compute-bound when arithmetic throughput (FLOPs) saturates the compute units and memory bandwidth is no longer the bottleneck. The same model, running on the same GPU, can live in either regime depending on batch size, sequence length, and the ratio of memory loads to arithmetic operations — the arithmetic intensity of the workload (Source: sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).

Roofline view¶

The roofline model plots performance (FLOPs/s achieved) vs arithmetic intensity (FLOPs per byte loaded from memory). Two ceilings:

Memory-bandwidth ceiling — peak bytes/s × arithmetic intensity. In the memory-bound regime the workload rides this ceiling, limited by how fast data arrives.
Peak FLOPs ceiling — the hardware's peak compute rate, reached only when arithmetic intensity is high enough that memory can keep compute fed.

The transition point between the two ceilings is the saturation point on a given GPU for a given workload — see saturation point for the operational form.

Why short-request query inference is memory-bound¶

Voyage AI's 2025-12-18 post characterises its query-side embedding workload as structurally memory-bound:

"Queries are typically short, and their token-length distribution is highly skewed. As a result, query inference tends to be memory-bound rather than compute-bound."

Three compounding reasons:

Low per-request sequence length — short sequences produce few activations, so attention + MLP FLOPs per forward pass are low relative to the model weights that must be loaded from HBM. Weight-loading cost dominates.
Fixed per-request overheads — kernel launches, scheduling, attention-mask setup, pooling, normalisation — are roughly independent of sequence length and together dominate latency below the saturation point. "For small requests, fixed per-request overheads (like GPU scheduling, memory movement, pooling and normalization, etc.) dominate, and latency stays nearly constant."
Low batch size without batching — serving one short request at a time means loading the whole model from HBM for each request, extracting tiny output activations. Arithmetic intensity is near zero; the GPU spends most of its time waiting for memory.

Batching moves the workload from memory-bound to compute-bound¶

Combining many short requests into one forward pass raises arithmetic intensity: the model weights are loaded from HBM once, but many sequences' tokens are processed against them. The workload shifts up the roofline's memory-bandwidth ceiling toward the peak-FLOPs ceiling. At the saturation point the two ceilings meet and the workload becomes compute-bound. Further batching stays on the peak-FLOPs ceiling but adds linear latency.

Voyage AI names this effect directly:

"Batching short requests can move the inference from memory-bound to compute-bound."

This is the economic rationale for token-count-based batching: the workload is wasted GPU in its natural memory-bound state; batching converts that waste into throughput.

Why you can't autoscale out of it¶

A naive response to bursty memory-bound workloads is horizontal scale-out: add more GPUs. But each new GPU is memory-bound too — sequential serving on the new GPU is exactly as inefficient as sequential serving on the old one. Scale-out multiplies capacity, not efficiency; the compute is still stranded in the memory-bound regime.

Token-count batching is the efficiency primitive; autoscaling is orthogonal to it and needed only once batched GPUs are themselves saturated.

Generalisation beyond embedding inference¶

The memory-bound vs compute-bound distinction is the same lens used to analyse:

LLM decoding — token-by-token generation is memory-bound because each new token loads the full KV cache + model weights for one position of compute; this is why speculative decoding, continuous batching, paged attention all aim to increase arithmetic intensity per memory load.
Training vs inference — training's much larger batch sizes push training naturally toward compute-bound; inference's smaller, latency-constrained batches keep it memory-bound.
Small-model / edge inference — model fits in cache, arithmetic intensity higher, workload is often compute-bound even at low batch size.

Seen in¶

2025-12-18 Voyage AI / MongoDB — Token-count-based batching — canonical wiki instance; query embedding inference is characterised as "memory-bound rather than compute-bound", with batching the primary lever to move inference toward compute-bound / near-saturation-point (sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference).