CONCEPT Cited by 2 sources
Async CPU-GPU pipelined scheduling¶
Definition¶
Async CPU-GPU pipelined scheduling is the inference-scheduler discipline of moving CPU-side post-processing of batch N off the critical path so it runs concurrently with the next GPU forward pass for batch N+1. The scheduler dispatches the next batch immediately rather than serialising on completion of the previous batch's CPU work.
It is the natural complement to patterns/multiprocessing-runtime-for-cpu-bound-serving — multiprocessing parallelises across pods/processes, async scheduling parallelises across the time axis within a process. Both target the same root cause: a small/fast model on a fast GPU spends idle GPU time waiting on CPU work that doesn't have to be on the critical path.
The classical (synchronous) scheduler¶
A synchronous serving scheduler processes one batch at a time:
[GPU forward pass batch N] → [CPU post-process batch N (full)]
→ [CPU prepare batch N+1]
→ [GPU forward pass batch N+1] → …
The GPU sits idle for the duration of both the CPU post-processing and prepare stages of every batch. For GPU-bound workloads (large LLMs, long sequences) the GPU stage dominates and idle GPU between batches is a small share of total time. For small/fast models the CPU stages can rival or exceed the GPU stage; idle GPU between batches becomes the dominant inefficiency.
The async pipelined scheduler¶
The pipelined version reorganises:
[GPU forward pass batch N] → [GPU forward pass batch N+1]
\ \
v v
[CPU post-process N] [CPU post-process N+1]
(off critical path) (off critical path)
The CPU post-processing of batch N runs concurrently with the GPU forward pass for batch N+1 — "the scheduler dispatches N+1 immediately and handles N's post-processing in parallel."
A second optimisation in the same family: post-processing iterates only over the relevant subset of requests rather than the full batch. Many batches have a long tail of requests that don't need per-batch-iteration post-processing every step; restricting iteration to the active subset shrinks CPU work proportionally.
Canonical wiki disclosure¶
The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first canonical disclosure of this scheduler shape for production LLM serving:
"We moved CPU-side post-processing off the critical path so it runs concurrently with the next GPU forward pass. Rather than finishing all post-processing for batch N before launching batch N+1, the scheduler dispatches N+1 immediately and handles N's post-processing in parallel. Post-processing also iterates only over the relevant subset of requests rather than the full batch. This resulted in the next forward pass starting sooner."
The reported gain is "a few percentage points" on top of the multiprocessing fix's ~20% — meaningful but not dominant. Async scheduling rounds out the CPU-side optimisations, and is most useful when multiprocessing has already removed the easy gains.
Why this works¶
The CPU and GPU are separate execution engines with separate resources. A request's lifecycle has stages naturally bound to one or the other:
- Prepare (CPU): tokenisation, embedding lookup, batching metadata, padding mask construction.
- Forward pass (GPU): the actual transformer compute.
- Post-process (CPU): detokenisation, response formatting, sampling decisions for next-token continuation, telemetry.
A synchronous scheduler treats the three as a single critical chain. The pipelined scheduler treats them as producer-consumer stages with the GPU as the dominant-cost-but-fixed-throughput resource that should never be allowed to idle.
Sibling optimisation: single-call C++ tensor manipulation¶
Same post discloses a complementary CPU-side optimisation that attacks per-batch overhead:
"We replaced Python-level tensor slicing, copying, and filling at the start of each CUDA graph decode step with a single C++ call. We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead. This cut GPU idle slightly per forward pass."
The lesson: CUDA synchronisation overhead can make a parallel CPU implementation slower than single-threaded C++. The single-threaded path avoids per-thread CUDA sync calls; ThreadPool / OpenMP would gain CPU parallelism but lose more on the sync side. Both optimisations contribute "a few percentage points" each.
When the optimisation matters most¶
- Small / fast models where forward-pass time approaches CPU per-batch overhead (see concepts/cpu-bound-serving-small-fast-model).
- High-QPS serving where idle GPU between batches accumulates into measurable throughput loss.
- Models with significant per-step post-processing (sampling, beam-search bookkeeping, structured-output validation).
For workloads with large GPU stages relative to CPU stages (70B-parameter LLMs, long-context prefills) the async pipelining gain is small — the GPU stage already dominates and idle CPU is already absorbed by per-step post-processing in parallel with the big GPU stage.
Seen in¶
- sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure as one of the CPU-side optimisations under the small-fast-model regime. Each of "single C++ call" + "async scheduling" + "iterate-only-active-subset" contributes a few percent on top of the FP8 + multiprocessing gains. Combined with the dominant 30% (FP8) + 20% (multiprocessing) levers, they round out the headline 60% per-pod throughput improvement (750 → 1,200 QPS on H100).
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — same overlap shape at video-frame-decode altitude rather than LLM-batch-serving altitude. The Synthesia / AWS Asynchronous Frame Generation Pipeline overlaps chunk-N's D2H + host-side file write with chunk-N+1's GPU compute — the same producer-consumer-staging logic, but at the granularity of video frames instead of LLM batches, and using explicit dual CUDA streams + pinned host buffers + CUDA events + a dedicated worker thread instead of an LLM scheduler's batch-dispatch logic. Result: GPU kernel utilisation rises 82% → 99.9% on the Wan 2.2 14B VAE decoder on g7e.2xlarge. The two altitudes share the structural insight — the GPU is the dominant-cost-but-fixed-throughput resource that should never be allowed to idle while any other lane has work it could do — and differ only in granularity and scheduling primitive.
Caveats¶
- Reordering risk — async post-processing for batch N must complete before its responses can be returned. Care needed to preserve per-request response ordering and error paths.
- Memory pressure — multiple in-flight batches require the scheduler to hold their intermediate state simultaneously; not free for memory-constrained serving.
- Sampling correctness — for autoregressive generation, the next token of one request may depend on previous tokens; the pipelining must not break token-level data dependencies.
- The post does not disclose the exact scheduler implementation or the buffering depth.
Related¶
- concepts/cpu-bound-serving-small-fast-model — the regime in which this optimisation matters
- concepts/effective-batch-size — the throughput axis the scheduler is trying to maximise
- concepts/batching-latency-tradeoff — the broader batching- scheduler design space
- concepts/cuda-throughput-budget — the underlying GPU duty- cycle metric this optimisation defends
- concepts/cuda-stream — explicit-stream primitive for the same overlap at finer granularity
- concepts/pinned-memory — required for fully-async D2H at video-frame altitude
- concepts/gpu-kernel-utilization — the saturation metric this concept's optimisations target (82% → 99.9% in the video-frame-altitude sibling)
- patterns/multiprocessing-runtime-for-cpu-bound-serving — the parallelism-axis sibling fix
- patterns/asynchronous-frame-generation-pipeline — the video-frame-altitude sibling pattern
- systems/databricks-model-serving — canonical platform instance
- systems/nvidia-h100 — canonical fast-GPU on which the regime appears
- systems/aws-ec2-g7e — canonical inference substrate for the video-frame-altitude sibling