Skip to content

CONCEPT Cited by 2 sources

Async CPU-GPU pipelined scheduling

Definition

Async CPU-GPU pipelined scheduling is the inference-scheduler discipline of moving CPU-side post-processing of batch N off the critical path so it runs concurrently with the next GPU forward pass for batch N+1. The scheduler dispatches the next batch immediately rather than serialising on completion of the previous batch's CPU work.

It is the natural complement to patterns/multiprocessing-runtime-for-cpu-bound-serving — multiprocessing parallelises across pods/processes, async scheduling parallelises across the time axis within a process. Both target the same root cause: a small/fast model on a fast GPU spends idle GPU time waiting on CPU work that doesn't have to be on the critical path.

The classical (synchronous) scheduler

A synchronous serving scheduler processes one batch at a time:

[GPU forward pass batch N] → [CPU post-process batch N (full)]
                            → [CPU prepare batch N+1]
                            → [GPU forward pass batch N+1] → …

The GPU sits idle for the duration of both the CPU post-processing and prepare stages of every batch. For GPU-bound workloads (large LLMs, long sequences) the GPU stage dominates and idle GPU between batches is a small share of total time. For small/fast models the CPU stages can rival or exceed the GPU stage; idle GPU between batches becomes the dominant inefficiency.

The async pipelined scheduler

The pipelined version reorganises:

[GPU forward pass batch N]  →  [GPU forward pass batch N+1]
       \                            \
        v                            v
        [CPU post-process N]         [CPU post-process N+1]
        (off critical path)           (off critical path)

The CPU post-processing of batch N runs concurrently with the GPU forward pass for batch N+1 — "the scheduler dispatches N+1 immediately and handles N's post-processing in parallel."

A second optimisation in the same family: post-processing iterates only over the relevant subset of requests rather than the full batch. Many batches have a long tail of requests that don't need per-batch-iteration post-processing every step; restricting iteration to the active subset shrinks CPU work proportionally.

Canonical wiki disclosure

The 2026-05-08 Databricks Model Serving / Superhuman post is the wiki's first canonical disclosure of this scheduler shape for production LLM serving:

"We moved CPU-side post-processing off the critical path so it runs concurrently with the next GPU forward pass. Rather than finishing all post-processing for batch N before launching batch N+1, the scheduler dispatches N+1 immediately and handles N's post-processing in parallel. Post-processing also iterates only over the relevant subset of requests rather than the full batch. This resulted in the next forward pass starting sooner."

(Source: sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together)

The reported gain is "a few percentage points" on top of the multiprocessing fix's ~20% — meaningful but not dominant. Async scheduling rounds out the CPU-side optimisations, and is most useful when multiprocessing has already removed the easy gains.

Why this works

The CPU and GPU are separate execution engines with separate resources. A request's lifecycle has stages naturally bound to one or the other:

  • Prepare (CPU): tokenisation, embedding lookup, batching metadata, padding mask construction.
  • Forward pass (GPU): the actual transformer compute.
  • Post-process (CPU): detokenisation, response formatting, sampling decisions for next-token continuation, telemetry.

A synchronous scheduler treats the three as a single critical chain. The pipelined scheduler treats them as producer-consumer stages with the GPU as the dominant-cost-but-fixed-throughput resource that should never be allowed to idle.

Sibling optimisation: single-call C++ tensor manipulation

Same post discloses a complementary CPU-side optimisation that attacks per-batch overhead:

"We replaced Python-level tensor slicing, copying, and filling at the start of each CUDA graph decode step with a single C++ call. We also explored parallel strategies (ThreadPool, OpenMP) but single-threaded C++ was optimal due to CUDA synchronization overhead. This cut GPU idle slightly per forward pass."

The lesson: CUDA synchronisation overhead can make a parallel CPU implementation slower than single-threaded C++. The single-threaded path avoids per-thread CUDA sync calls; ThreadPool / OpenMP would gain CPU parallelism but lose more on the sync side. Both optimisations contribute "a few percentage points" each.

When the optimisation matters most

  • Small / fast models where forward-pass time approaches CPU per-batch overhead (see concepts/cpu-bound-serving-small-fast-model).
  • High-QPS serving where idle GPU between batches accumulates into measurable throughput loss.
  • Models with significant per-step post-processing (sampling, beam-search bookkeeping, structured-output validation).

For workloads with large GPU stages relative to CPU stages (70B-parameter LLMs, long-context prefills) the async pipelining gain is small — the GPU stage already dominates and idle CPU is already absorbed by per-step post-processing in parallel with the big GPU stage.

Seen in

  • sources/2026-05-08-databricks-how-superhuman-and-databricks-built-a-200k-qps-inference-platform-together — first canonical wiki disclosure as one of the CPU-side optimisations under the small-fast-model regime. Each of "single C++ call" + "async scheduling" + "iterate-only-active-subset" contributes a few percent on top of the FP8 + multiprocessing gains. Combined with the dominant 30% (FP8) + 20% (multiprocessing) levers, they round out the headline 60% per-pod throughput improvement (750 → 1,200 QPS on H100).
  • sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — same overlap shape at video-frame-decode altitude rather than LLM-batch-serving altitude. The Synthesia / AWS Asynchronous Frame Generation Pipeline overlaps chunk-N's D2H + host-side file write with chunk-N+1's GPU compute — the same producer-consumer-staging logic, but at the granularity of video frames instead of LLM batches, and using explicit dual CUDA streams + pinned host buffers + CUDA events + a dedicated worker thread instead of an LLM scheduler's batch-dispatch logic. Result: GPU kernel utilisation rises 82% → 99.9% on the Wan 2.2 14B VAE decoder on g7e.2xlarge. The two altitudes share the structural insight — the GPU is the dominant-cost-but-fixed-throughput resource that should never be allowed to idle while any other lane has work it could do — and differ only in granularity and scheduling primitive.

Caveats

  • Reordering risk — async post-processing for batch N must complete before its responses can be returned. Care needed to preserve per-request response ordering and error paths.
  • Memory pressure — multiple in-flight batches require the scheduler to hold their intermediate state simultaneously; not free for memory-constrained serving.
  • Sampling correctness — for autoregressive generation, the next token of one request may depend on previous tokens; the pipelining must not break token-level data dependencies.
  • The post does not disclose the exact scheduler implementation or the buffering depth.
Last updated · 542 distilled / 1,571 read