Skip to content

CONCEPT Cited by 1 source

CUDA stream

Definition

A CUDA stream is an in-order queue of GPU operations (kernels, memory copies, events). Operations enqueued on the same CUDA stream execute strictly in the order they were issued. Operations on different CUDA streams may execute concurrently if the GPU has the hardware engines to run them in parallel — most notably, NVIDIA GPUs have separate compute engines (SMs) and copy engines that can run kernels and memory transfers simultaneously.

By default, frameworks like PyTorch schedule everything onto a single default stream per device. That default-stream behaviour serialises compute and memory transfers even though the hardware would allow them to overlap.

Why dual streams are the load-bearing primitive for inference overlap

A naive inference loop on the default stream looks like:

default_stream:  [kernel N] [D2H copy N] [kernel N+1] [D2H copy N+1]

Even if the GPU has a free copy engine while kernel N+1 runs, the default-stream contract says D2H copy N+1 waits for kernel N+1 to finish. The compute engine and copy engine cannot both be busy.

Splitting the work onto two streams unlocks the hardware:

Compute Stream (default):  [kernel N] [kernel N+1] [kernel N+2] …
Copy Stream    (dedicated): [D2H copy N] [D2H copy N+1] [D2H copy N+2] …

Now kernel N+1 and D2H copy N can run concurrently — one on the compute engine, one on the copy engine. This is what AWS / Synthesia's patterns/dual-cuda-stream-compute-and-copy-overlap exploits to lift the GPU kernel utilisation of the Wan 2.2 14B VAE decoder from 82% to 99.9%. (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

Stream synchronisation — CUDA events as cross-stream barriers

Streams alone are unsafe for shared buffers: chunk N+1's compute on the Compute Stream could overwrite chunk N's bytes before the D2H on the Copy Stream finishes reading them.

The synchronisation primitive is the CUDA event. An event is recorded on one stream after some operation completes ("chunk N decode done"), and other streams wait on the event before proceeding ("start D2H of chunk N only after the event clears").

In the AWS / Synthesia pipeline:

  • After decode chunk N kernels finish on Compute Stream → an event is recorded.
  • The Copy Stream waits on that event before starting the D2H of chunk N's frames.
  • After D2H chunk N finishes on Copy Stream → another event is recorded.
  • The worker thread waits on that event before reading chunk N from the pinned host buffer.

This event-mediated handoff is what makes the patterns/double-buffer-cuda-events-pipeline-overlap pattern safe to compose with the dual-stream pattern.

Composition rules

For dual CUDA streams to actually deliver overlap:

  • Pinned host destination buffers are required (see concepts/pinned-memory) — pageable memory forces D2H to stage through a pinned bounce buffer, re-serialising it onto the default stream.
  • Stream non-blocking flag must be set when the dedicated Copy Stream is created (otherwise it inherits a synchronisation edge to the default stream).
  • Distinct memory regions for adjacent chunks (i.e. double buffering) — same address means false aliasing, which forces serialisation regardless of streams.
  • Events must be recorded and waited on correctly to maintain data dependency semantics.

Skip any one of these and the dual-stream code will run but not overlap.

Beyond compute + copy — multi-stream patterns

The wiki-attested pattern uses two streams (Compute + Copy). NVIDIA GPUs typically have multiple copy engines (one in each direction: H2D and D2H), so it's possible to extend to:

  • Compute Stream + D2H Copy Stream + H2D Copy Stream — useful when uploading new inputs concurrent with compute and downloading prior outputs.
  • Multiple Compute Streams — useful when independent compute workloads can be kernel-merged onto the same GPU but from different host threads.

The wiki currently canonicalises only the two-stream Compute+Copy case; extend this page if a future source attests the three-stream case.

Seen in

Last updated · 542 distilled / 1,571 read