PATTERN Cited by 1 source
Multiprocessing runtime for CPU-bound serving¶
Pattern¶
Run multiple RPC server processes per pod, each preparing and dispatching work to the GPU in parallel, instead of the standard single-process serving loop. The fix targets a specific regime — CPU-bound serving for small fast models — where the GPU finishes its forward pass faster than a single CPU process can prepare the next batch. Multiprocessing eliminates the single-process serialisation bottleneck.
When to use¶
The pattern applies when all of the following hold (see concepts/cpu-bound-serving-small-fast-model):
- The model's GPU forward pass is fast (small model, short sequences, or both).
- The GPU is observed to be idle between batches despite continuous request flow.
- A single inference process is at 100% CPU on its core while the GPU is below its target duty cycle.
- Process-creation cost and cross-process IPC overhead are acceptable in the pod's resource budget.
When all four hold, multiprocessing the serving runtime is the canonical first-line fix and the dominant lever — the 2026-05-08 Databricks Model Serving / Superhuman post reports a +20% throughput gain from the multiprocessing fix alone, on top of FP8's 30%.
Mechanism¶
Standard single-process serving:
+---------------------------+ +----------+
| Single Python/Scala proc | ---> | GPU |
| - request queue | | |
| - batch prep | +----------+
| - dispatch | ^
| - post-process | | idles between
+---------------------------+ | batches
Multiprocessing serving:
+----------+ +----------+ +----------+
| RPC | | RPC | | RPC |
| proc 1 | | proc 2 | | proc 3 | ...
| prep + | | prep + | | prep + |
| dispatch| | dispatch| | dispatch|
+----+-----+ +-----+----+ +-----+----+
| | |
+-------+-------+--------------+
v
+-------+
| GPU | ← saturated by parallel dispatchers
+-------+
Multiple CPU processes prepare batches independently and dispatch work to the GPU in parallel. The GPU sees a continuous stream of batches and stays saturated; no single process is the rate-limiter.
Why multiprocessing, not multithreading¶
Two reasons disclosed in the same post:
- Python's GIL. For inference engines with Python in the batch-prep critical path, threads cannot run CPU-bound prep code in parallel; processes can. Even for non-Python engines the point holds for any runtime where per-request CPU code holds shared state.
- CUDA synchronisation overhead. "We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead." Threads sharing a CUDA context pay sync overhead that erodes the parallelism gain. Separate processes either each have their own context (pay startup cost once, no per-call sync) or use lower-overhead IPC to a CUDA-owning process.
Canonical wiki disclosure¶
The 2026-05-08 Superhuman post is the wiki's first canonical disclosure of the multiprocessing-runtime fix at production LLM serving altitude:
"Specifically the team introduced a multiprocessing runtime server. For most model serving workloads, a single process is more than fast enough to keep the GPU saturated, since the GPU is the bottleneck, not the CPU. But with a small, fast model, the GPU completes its forward pass faster than a single process can prepare the next batch, flipping the bottleneck to the CPU."
"The team addressed this by running multiple RPC server processes. By having multiple CPU processes prepare and dispatch work to the GPU in parallel, we eliminated the single-process serialization bottleneck. This delivered another 20% additional throughput."
The Superhuman grammar-correction LLM (~50 in / 50 out tokens, on H100) is the worked instance. The same post notes Databricks applied related CPU-side optimisations earlier in their work on fast PEFT serving — small adapter-merged models hit the same regime.
Operational considerations¶
- Process count tuning. Too few processes → CPU-bound bottleneck remains; too many → process-creation + memory overhead exceeds the parallelism gain. Tune empirically per pod shape.
- GPU context sharing. Multiple processes touching one GPU must agree on context ownership: either each owns its own (memory cost), or they IPC to a dedicated GPU-owning process. Choice shapes the rest of the architecture.
- Memory duplication. Worker processes that load the model weights independently waste GPU memory; weight sharing via shared-memory regions or MPS-style multi-tenancy is the fix.
- Backpressure. Multiple parallel dispatchers can outrun the
GPU under bursty load; coordinate via per-pod
request_concurrencybudgets (see concepts/request-concurrency-as-autoscaling-signal). - Observability. Per-process metrics need aggregation; a single-process view of the pod is no longer the truth.
Complementary optimisations¶
The multiprocessing fix sits in a stack with:
- concepts/async-cpu-gpu-pipelined-scheduling — the time-axis parallelism (overlap CPU post-processing of batch N with GPU forward pass of batch N+1). Adds a few percent on top.
- Single-call C++ tensor manipulation — replace per-step Python tensor slicing/copying/filling with one C++ call. Single-threaded; ThreadPool / OpenMP attempts hurt due to CUDA sync overhead. Adds a few percent.
- Iterate-only-active-subset post-processing — restrict per-step CPU work to requests that need it, not the full batch. Adds a few percent.
Together with FP8 quantisation (+30%) and the ~20% from multiprocessing, the Superhuman post reports a headline +60% per-pod throughput improvement (750 QPS → 1,200 QPS on H100).
When not to use¶
- GPU-bound workloads (large models, long contexts) where the GPU is already the bottleneck. Multiprocessing the runtime adds cost without benefit.
- Inference engines that already are multi-process (vLLM with TP, certain fast-PEFT serving configs) — the gain has already been captured.
- Memory-constrained pods where duplicate model weights from multiple worker processes don't fit. Address weight sharing first or pick a different fix.
Failure modes and mitigations¶
- Process crash → request loss. Per-process supervisor restarts; queue depth limits prevent over-acceptance during recovery.
- GPU memory fragmentation across processes. Mitigation: shared GPU-owning process; MIG / MPS partitioning.
- Cross-process context-switch latency under high request rate. Mitigation: bind workers to cores, isolate from system noise.
Sibling patterns¶
- patterns/oom-aware-vm-restart-autoscaling — also a serving-runtime resilience pattern, different failure mode (OOM not CPU saturation).
- patterns/multi-signal-workload-aware-gateway-routing — upstream of this pattern; the gateway routes to the pod, the multiprocessing runtime saturates the GPU within the pod.
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — canonical wiki instance. Multiple RPC server processes per pod delivered ~20% additional throughput on top of FP8's 30% for the Superhuman grammar-correction LLM at 200K+ QPS, on H100. "By having multiple CPU processes prepare and dispatch work to the GPU in parallel, we eliminated the single-process serialization bottleneck."
Caveats¶
- The post does not disclose the process count used in production, GPU-context sharing strategy, or the tuning curve.
- Multiprocessing alone is not a complete fix — it is paired with async scheduling and single-call C++ tensor ops to round out the per-pod throughput improvement.
- Generalisability beyond the small-fast-model regime is not claimed.
Related¶
- concepts/cpu-bound-serving-small-fast-model — the regime this pattern targets
- concepts/async-cpu-gpu-pipelined-scheduling — the complementary scheduler optimisation
- concepts/effective-batch-size — the throughput axis
- concepts/cuda-throughput-budget — the GPU-side budget
- concepts/model-flops-utilization — the saturation metric this pattern defends
- systems/databricks-model-serving — canonical platform instance
- systems/nvidia-h100 — canonical hardware
- systems/vllm — Superhuman's pre-migration engine, designed under the GPU-bound assumption