Skip to content

PATTERN Cited by 1 source

Dual CUDA stream compute and copy overlap

Definition

The dual CUDA stream compute and copy overlap pattern is the narrowest form of GPU pipeline overlap: split GPU work onto two CUDA streams — a Compute Stream for kernels and a Copy Stream for memory transfers — so the GPU's physically separate compute and copy engines can run them concurrently.

By default, frameworks like PyTorch enqueue all work onto a single default stream per device, which serialises compute and D2H transfers onto a single ordering edge. The single-stream serialisation leaves the copy engine idle while the compute engine runs (and vice versa). Issuing transfers on a dedicated Copy Stream lifts that constraint.

Structure

Compute Stream (default):  [kernel N] [kernel N+1] [kernel N+2] [kernel N+3] …
Copy Stream    (dedicated):           [D2H N]      [D2H N+1]    [D2H N+2]    …

                            ──────────────────────────────────────────────────►
                            time

While kernel N+1 runs on the SMs, D2H N runs on the copy engine — both forms of useful GPU work in the same wall-clock window. The wiki-attested Synthesia VAE-decoder benchmark shows this single move (combined with pinned host buffers and double buffering) lifting GPU kernel utilisation from 82% to 99.9%.

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

When to apply

  • Per-iteration D2H — every chunk / batch / frame of output has to leave the GPU for downstream processing or storage.
  • Visible GPU stalls in profiling — the timeline shows kernel gaps that line up with D2H + host-I/O windows.
  • Sufficient compute to amortise the overlap — the GPU stage per chunk must be long enough to actually run during the D2H of the previous chunk. For very small kernels the overlap doesn't help.

Necessary co-primitives

Dual streams alone are not enough. To deliver real overlap:

  • Pinned host buffers for the D2H destination — pageable destinations force a bounce-buffer copy on the default stream, defeating the dual-stream split.
  • Double buffering — adjacent chunks must operate on distinct memory regions; otherwise the second chunk's compute aliases the first chunk's not-yet-D2H'd VRAM. See patterns/double-buffer-cuda-events-pipeline-overlap.
  • CUDA events as cross-stream barriers — for safe handoff between the two streams.

Skip any of these and the dual-stream code runs but the GPU stalls anyway.

PyTorch sketch

compute_stream = torch.cuda.default_stream()
copy_stream    = torch.cuda.Stream()

vram_buf       = [torch.empty(...), torch.empty(...)]                  # two VRAM bufs
host_buf       = [torch.empty(..., pin_memory=True),
                  torch.empty(..., pin_memory=True)]                   # two pinned host bufs
done_decode    = [torch.cuda.Event(), torch.cuda.Event()]
done_d2h       = [torch.cuda.Event(), torch.cuda.Event()]

for n, latent in enumerate(latents):
    i = n % 2  # double-buffer index

    with torch.cuda.stream(compute_stream):
        decode(latent, out=vram_buf[i])
        done_decode[i].record(stream=compute_stream)

    with torch.cuda.stream(copy_stream):
        copy_stream.wait_event(done_decode[i])
        host_buf[i].copy_(vram_buf[i], non_blocking=True)
        done_d2h[i].record(stream=copy_stream)

    # worker thread waits on done_d2h[i] before writing host_buf[i] to disk

This is a simplified sketch — the full reference implementation adds correctness for end-of-stream draining, error handling, and worker-thread coordination.

Hardware-side why-it-works

NVIDIA GPUs have:

  • One or more streaming multiprocessor (SM) clusters that execute kernels — the "compute engine".
  • One or more DMA engines (typically called "copy engines") that move bytes between host and device — separate hardware from the SMs.

These engines can run in parallel as long as the host code makes their work paths independent. Dual CUDA streams is the software-side primitive for declaring "this work is on the SMs, that work is on a copy engine — go in parallel".

The wiki-attested RTX PRO 6000 Blackwell exemplifies this — its compute and copy engines are physically separate, which is what makes the dual-stream pattern actually deliver overlap.

Generalisation

The pattern composes upward into:

It composes sideways with:

  • Multi-stream patterns that add a third stream for H2D (e.g. uploading next-batch inputs concurrent with compute and download).
  • Multi-process patterns (concepts/async-cpu-gpu-pipelined-scheduling) where independent host processes share a GPU and overlap at the process level rather than the stream level.

Seen in

Last updated · 542 distilled / 1,571 read