CONCEPT Cited by 2 sources

Async CPU-GPU pipelined scheduling¶

Definition¶

Async CPU-GPU pipelined scheduling is the inference-scheduler discipline of moving CPU-side post-processing of batch N off the critical path so it runs concurrently with the next GPU forward pass for batch N+1. The scheduler dispatches the next batch immediately rather than serialising on completion of the previous batch's CPU work.

It is the natural complement to patterns/multiprocessing-runtime-for-cpu-bound-serving — multiprocessing parallelises across pods/processes, async scheduling parallelises across the time axis within a process. Both target the same root cause: a small/fast model on a fast GPU spends idle GPU time waiting on CPU work that doesn't have to be on the critical path.

The classical (synchronous) scheduler¶

A synchronous serving scheduler processes one batch at a time:

[GPU forward pass batch N] → [CPU post-process batch N (full)]
                            → [CPU prepare batch N+1]
                            → [GPU forward pass batch N+1] → …

The GPU sits idle for the duration of both the CPU post-processing and prepare stages of every batch. For GPU-bound workloads (large LLMs, long sequences) the GPU stage dominates and idle GPU between batches is a small share of total time. For small/fast models the CPU stages can rival or exceed the GPU stage; idle GPU between batches becomes the dominant inefficiency.

The async pipelined scheduler¶

The pipelined version reorganises:

[GPU forward pass batch N]  →  [GPU forward pass batch N+1]
       \                            \
        v                            v
        [CPU post-process N]         [CPU post-process N+1]
        (off critical path)           (off critical path)

The CPU post-processing of batch N runs concurrently with the GPU forward pass for batch N+1 — "the scheduler dispatches N+1 immediately and handles N's post-processing in parallel."

A second optimisation in the same family: post-processing iterates only over the relevant subset of requests rather than the full batch. Many batches have a long tail of requests that don't need per-batch-iteration post-processing every step; restricting iteration to the active subset shrinks CPU work proportionally.

Canonical wiki disclosure¶

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first canonical disclosure of this scheduler shape for production LLM serving:

"We moved CPU-side post-processing off the critical path so it runs concurrently with the next GPU forward pass. Rather than finishing all post-processing for batch N before launching batch N+1, the scheduler dispatches N+1 immediately and handles N's post-processing in parallel. Post-processing also iterates only over the relevant subset of requests rather than the full batch. This resulted in the next forward pass starting sooner."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The reported gain is "a few percentage points" on top of the multiprocessing fix's ~20% — meaningful but not dominant. Async scheduling rounds out the CPU-side optimisations, and is most useful when multiprocessing has already removed the easy gains.

Why this works¶

The CPU and GPU are separate execution engines with separate resources. A request's lifecycle has stages naturally bound to one or the other:

Prepare (CPU): tokenisation, embedding lookup, batching metadata, padding mask construction.
Forward pass (GPU): the actual transformer compute.
Post-process (CPU): detokenisation, response formatting, sampling decisions for next-token continuation, telemetry.

A synchronous scheduler treats the three as a single critical chain. The pipelined scheduler treats them as producer-consumer stages with the GPU as the dominant-cost-but-fixed-throughput resource that should never be allowed to idle.

Sibling optimisation: single-call C++ tensor manipulation¶

Same post discloses a complementary CPU-side optimisation that attacks per-batch overhead:

"We replaced Python-level tensor slicing, copying, and filling at the start of each CUDA graph decode step with a single C++ call. We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead. This cut GPU idle slightly per forward pass."

The lesson: CUDA synchronisation overhead can make a parallel CPU implementation slower than single-threaded C++. The single-threaded path avoids per-thread CUDA sync calls; ThreadPool / OpenMP would gain CPU parallelism but lose more on the sync side. Both optimisations contribute "a few percentage points" each.

When the optimisation matters most¶

Small / fast models where forward-pass time approaches CPU per-batch overhead (see concepts/cpu-bound-serving-small-fast-model).
High-QPS serving where idle GPU between batches accumulates into measurable throughput loss.
Models with significant per-step post-processing (sampling, beam-search bookkeeping, structured-output validation).

For workloads with large GPU stages relative to CPU stages (70B-parameter LLMs, long-context prefills) the async pipelining gain is small — the GPU stage already dominates and idle CPU is already absorbed by per-step post-processing in parallel with the big GPU stage.

Seen in¶

sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure as one of the CPU-side optimisations under the small-fast-model regime. Each of "single C++ call" + "async scheduling" + "iterate-only-active-subset" contributes a few percent on top of the FP8 + multiprocessing gains. Combined with the dominant 30% (FP8) + 20% (multiprocessing) levers, they round out the headline 60% per-pod throughput improvement (750 → 1,200 QPS on H100).
sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — same overlap shape at video-frame-decode altitude rather than LLM-batch-serving altitude. The Synthesia / AWS Asynchronous Frame Generation Pipeline overlaps chunk-N's D2H + host-side file write with chunk-N+1's GPU compute — the same producer-consumer-staging logic, but at the granularity of video frames instead of LLM batches, and using explicit dual CUDA streams + pinned host buffers + CUDA events + a dedicated worker thread instead of an LLM scheduler's batch-dispatch logic. Result: GPU kernel utilisation rises 82% → 99.9% on the Wan 2.2 14B VAE decoder on g7e.2xlarge. The two altitudes share the structural insight — the GPU is the dominant-cost-but-fixed-throughput resource that should never be allowed to idle while any other lane has work it could do — and differ only in granularity and scheduling primitive.

Caveats¶

Reordering risk — async post-processing for batch N must complete before its responses can be returned. Care needed to preserve per-request response ordering and error paths.
Memory pressure — multiple in-flight batches require the scheduler to hold their intermediate state simultaneously; not free for memory-constrained serving.
Sampling correctness — for autoregressive generation, the next token of one request may depend on previous tokens; the pipelining must not break token-level data dependencies.
The post does not disclose the exact scheduler implementation or the buffering depth.

concepts/cpu-bound-serving-small-fast-model — the regime in which this optimisation matters
concepts/effective-batch-size — the throughput axis the scheduler is trying to maximise
concepts/batching-latency-tradeoff — the broader batching- scheduler design space
concepts/cuda-throughput-budget — the underlying GPU duty- cycle metric this optimisation defends
concepts/cuda-stream — explicit-stream primitive for the same overlap at finer granularity
concepts/pinned-memory — required for fully-async D2H at video-frame altitude
concepts/gpu-kernel-utilization — the saturation metric this concept's optimisations target (82% → 99.9% in the video-frame-altitude sibling)
patterns/multiprocessing-runtime-for-cpu-bound-serving — the parallelism-axis sibling fix
patterns/asynchronous-frame-generation-pipeline — the video-frame-altitude sibling pattern
systems/databricks-model-serving — canonical platform instance
systems/nvidia-h100 — canonical fast-GPU on which the regime appears
systems/aws-ec2-g7e — canonical inference substrate for the video-frame-altitude sibling