CONCEPT Cited by 1 source

CPU-bound serving regime for small fast models¶

Definition¶

The CPU-bound serving regime is the inference operating point at which a single CPU process can no longer prepare the next batch fast enough to keep the GPU saturated, flipping the bottleneck from GPU compute to CPU per-batch setup. It violates the standard serving-engine assumption that the GPU is the scarce resource.

It is a structural property of small, fast models on fast accelerators — not a tuning artefact. The same code on the same hardware would not be CPU-bound for a larger model.

The forcing condition¶

For a model whose forward pass on a given accelerator is fast enough that:

time_to_complete_forward_pass(model, batch) < time_to_prepare_next_batch(CPU)

the GPU finishes its decode step and waits idle for the CPU to assemble the next batch. The GPU duty cycle drops even though the GPU is the more expensive resource — a textbook misallocation.

Canonical wiki disclosure¶

The first explicit wiki canonicalisation of this regime is the 2026-05-08 Databricks Model Serving / Superhuman post:

"For small, fast models, performance is often bottlenecked by the CPU – not the GPU."

"For most model serving workloads, a single process is more than fast enough to keep the GPU saturated, since the GPU is the bottleneck, not the CPU. But with a small, fast model, the GPU completes its forward pass faster than a single process can prepare the next batch, flipping the bottleneck to the CPU."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The Superhuman grammar-correction model is the canonical instance: ~50 input + 50 output tokens per request, on H100, served at 200K+ QPS. The forward pass is fast enough that one CPU process cannot keep up.

Why this matters architecturally¶

Most off-the-shelf inference engines (including vLLM) were designed under the GPU-bound assumption — single-process scheduler, per-request Python tensor manipulation, sequential post-processing of the previous batch before launching the next. All three of those choices become bottlenecks under the small-fast-model regime.

A platform serving small fast models at high QPS has to redesign:

Batch dispatch — multiple CPU processes preparing batches in parallel (see patterns/multiprocessing-runtime-for-cpu-bound-serving).
Per-step preparation — replace Python-level tensor slicing / copying / filling with single C++ calls; CUDA synchronisation overhead means single-threaded C++ often beats ThreadPool / OpenMP parallel strategies.
Scheduler critical path — overlap CPU-side post-processing for batch N with the next GPU forward pass for batch N+1 instead of finishing N first (see concepts/async-cpu-gpu-pipelined-scheduling).

Databricks reports each of (2) and (3) contributing "a few percentage points" on top of the multiprocessing fix's ~20%; the multiprocessing fix is the dominant lever once the regime is recognised.

concepts/memory-bandwidth-bound-inference is a different GPU-side regime where HBM bandwidth (not Tensor Core compute, not CPU) gates throughput. The two regimes are orthogonal:

Compute-bound — Tensor Cores saturated; FP8 or other precision-narrowing is the lever.
Memory-bound — HBM bandwidth saturated; weight quantisation / smaller models / batch consolidation is the lever.
CPU-bound — neither GPU axis is the bottleneck; CPU per-batch setup is. Multiprocessing the runtime is the lever.

A small fast model running on H100 at hundreds-to-thousands of QPS per pod is the regime where CPU-bound dominates. A 70B-parameter model on the same hardware would be memory-bound long before the CPU became the issue.

When to look for this regime¶

Strong signals during pod-level benchmarking:

GPU utilisation < 70% during sustained load despite continuous request flow.
nvidia-smi's SM activity showing idle gaps between forward passes.
Adding more requests does not increase throughput — a hallmark of CPU-side serialisation.
Per-process CPU usage at 100% on the request-dispatching process while peer cores idle.

If all four hold, the system is in the small-fast-model CPU-bound regime and the multiprocessing fix is the canonical response.

Seen in¶

sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first explicit wiki canonicalisation of the regime; the Superhuman grammar-correction LLM (50 in / 50 out tokens, H100, 200K+ QPS) is the worked instance. The fix delivered ~20% additional throughput on top of FP8's 30% gain — confirming the regime is structurally distinct from the FP8/GPU-throughput bottleneck the same post addresses separately. "By having multiple CPU processes prepare and dispatch work to the GPU in parallel, we eliminated the single-process serialization bottleneck. This delivered another 20% additional throughput."

Caveats¶

The regime is workload-dependent — switching to longer contexts or heavier decode shapes can move the same model on the same hardware out of CPU-bound back into GPU-bound.
Multiprocessing has cost — process-creation overhead, IPC, and duplicated memory across worker processes are all non-zero. Worth-it only when the GPU duty cycle gap is large enough.
Fast-PEFT serving is a related Databricks workload where similar CPU-side optimisations apply (referenced in the Superhuman post but not detailed here).

concepts/cuda-throughput-budget — the GPU-side budget under which CPU-bound only matters once GPU work is "cheap enough"
concepts/inference-vs-training-workload-shape — the workload-shape framing within which small/fast/serving sits as a distinct point
concepts/memory-bandwidth-bound-inference — the different GPU-side bottleneck regime
concepts/effective-batch-size — what CPU-side multiprocessing is ultimately trying to keep high
concepts/async-cpu-gpu-pipelined-scheduling — the per-batch overlap fix that complements multiprocessing
patterns/multiprocessing-runtime-for-cpu-bound-serving — the canonical fix
systems/databricks-model-serving — canonical platform instance
systems/nvidia-h100 — canonical hardware where the regime tends to appear for small LLMs
systems/vllm — the engine the original Superhuman stack used, designed under the GPU-bound assumption