PATTERN Cited by 1 source
Dual CUDA stream compute and copy overlap¶
Definition¶
The dual CUDA stream compute and copy overlap pattern is the narrowest form of GPU pipeline overlap: split GPU work onto two CUDA streams — a Compute Stream for kernels and a Copy Stream for memory transfers — so the GPU's physically separate compute and copy engines can run them concurrently.
By default, frameworks like PyTorch enqueue all work onto a single default stream per device, which serialises compute and D2H transfers onto a single ordering edge. The single-stream serialisation leaves the copy engine idle while the compute engine runs (and vice versa). Issuing transfers on a dedicated Copy Stream lifts that constraint.
Structure¶
Compute Stream (default): [kernel N] [kernel N+1] [kernel N+2] [kernel N+3] …
Copy Stream (dedicated): [D2H N] [D2H N+1] [D2H N+2] …
──────────────────────────────────────────────────►
time
While kernel N+1 runs on the SMs, D2H N runs on the copy
engine — both forms of useful GPU work in the same wall-clock
window. The wiki-attested Synthesia VAE-decoder benchmark shows
this single move (combined with pinned host buffers and double
buffering) lifting GPU kernel
utilisation from 82% to 99.9%.
When to apply¶
- Per-iteration D2H — every chunk / batch / frame of output has to leave the GPU for downstream processing or storage.
- Visible GPU stalls in profiling — the timeline shows kernel gaps that line up with D2H + host-I/O windows.
- Sufficient compute to amortise the overlap — the GPU stage per chunk must be long enough to actually run during the D2H of the previous chunk. For very small kernels the overlap doesn't help.
Necessary co-primitives¶
Dual streams alone are not enough. To deliver real overlap:
- Pinned host buffers for the D2H destination — pageable destinations force a bounce-buffer copy on the default stream, defeating the dual-stream split.
- Double buffering — adjacent chunks must operate on distinct memory regions; otherwise the second chunk's compute aliases the first chunk's not-yet-D2H'd VRAM. See patterns/double-buffer-cuda-events-pipeline-overlap.
- CUDA events as cross-stream barriers — for safe handoff between the two streams.
Skip any of these and the dual-stream code runs but the GPU stalls anyway.
PyTorch sketch¶
compute_stream = torch.cuda.default_stream()
copy_stream = torch.cuda.Stream()
vram_buf = [torch.empty(...), torch.empty(...)] # two VRAM bufs
host_buf = [torch.empty(..., pin_memory=True),
torch.empty(..., pin_memory=True)] # two pinned host bufs
done_decode = [torch.cuda.Event(), torch.cuda.Event()]
done_d2h = [torch.cuda.Event(), torch.cuda.Event()]
for n, latent in enumerate(latents):
i = n % 2 # double-buffer index
with torch.cuda.stream(compute_stream):
decode(latent, out=vram_buf[i])
done_decode[i].record(stream=compute_stream)
with torch.cuda.stream(copy_stream):
copy_stream.wait_event(done_decode[i])
host_buf[i].copy_(vram_buf[i], non_blocking=True)
done_d2h[i].record(stream=copy_stream)
# worker thread waits on done_d2h[i] before writing host_buf[i] to disk
This is a simplified sketch — the full reference implementation adds correctness for end-of-stream draining, error handling, and worker-thread coordination.
Hardware-side why-it-works¶
NVIDIA GPUs have:
- One or more streaming multiprocessor (SM) clusters that execute kernels — the "compute engine".
- One or more DMA engines (typically called "copy engines") that move bytes between host and device — separate hardware from the SMs.
These engines can run in parallel as long as the host code makes their work paths independent. Dual CUDA streams is the software-side primitive for declaring "this work is on the SMs, that work is on a copy engine — go in parallel".
The wiki-attested RTX PRO 6000 Blackwell exemplifies this — its compute and copy engines are physically separate, which is what makes the dual-stream pattern actually deliver overlap.
Generalisation¶
The pattern composes upward into:
- patterns/asynchronous-frame-generation-pipeline — full three-way overlap (compute + D2H + host I/O) that adds the worker-thread and double-buffer pieces on top of dual streams.
It composes sideways with:
- Multi-stream patterns that add a third stream for H2D (e.g. uploading next-batch inputs concurrent with compute and download).
- Multi-process patterns (concepts/async-cpu-gpu-pipelined-scheduling) where independent host processes share a GPU and overlap at the process level rather than the stream level.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. Compute Stream (default) + Copy Stream (dedicated) is the half of the Asynchronous Frame Generation Pipeline that decouples kernel launches from D2H copies on the GPU; combined with pinned memory + double buffering + worker thread, lifts kernel utilisation from 82% to 99.9% on the Wan 2.2 14B VAE decoder on g7e.2xlarge.
Related¶
- concepts/cuda-stream — primitive underneath this pattern.
- concepts/pinned-memory — required co-primitive.
- concepts/device-to-host-transfer — operation moved to the Copy Stream.
- concepts/gpu-kernel-utilization — saturation metric improved by the dual-stream split.
- patterns/asynchronous-frame-generation-pipeline — umbrella pattern this composes into.
- patterns/double-buffer-cuda-events-pipeline-overlap — buffer + barrier composition that complements dual streams.
- concepts/async-cpu-gpu-pipelined-scheduling — same overlap shape at LLM-batch-serving altitude.