CONCEPT Cited by 1 source
CUDA stream¶
Definition¶
A CUDA stream is an in-order queue of GPU operations (kernels, memory copies, events). Operations enqueued on the same CUDA stream execute strictly in the order they were issued. Operations on different CUDA streams may execute concurrently if the GPU has the hardware engines to run them in parallel — most notably, NVIDIA GPUs have separate compute engines (SMs) and copy engines that can run kernels and memory transfers simultaneously.
By default, frameworks like PyTorch schedule everything onto a single default stream per device. That default-stream behaviour serialises compute and memory transfers even though the hardware would allow them to overlap.
Why dual streams are the load-bearing primitive for inference overlap¶
A naive inference loop on the default stream looks like:
Even if the GPU has a free copy engine while kernel N+1 runs,
the default-stream contract says D2H copy N+1 waits for
kernel N+1 to finish. The compute engine and copy engine cannot
both be busy.
Splitting the work onto two streams unlocks the hardware:
Compute Stream (default): [kernel N] [kernel N+1] [kernel N+2] …
Copy Stream (dedicated): [D2H copy N] [D2H copy N+1] [D2H copy N+2] …
Now kernel N+1 and D2H copy N can run concurrently — one
on the compute engine, one on the copy engine. This is what AWS /
Synthesia's patterns/dual-cuda-stream-compute-and-copy-overlap
exploits to lift the GPU kernel
utilisation of the Wan 2.2 14B VAE decoder from 82% to 99.9%.
(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)
Stream synchronisation — CUDA events as cross-stream barriers¶
Streams alone are unsafe for shared buffers: chunk N+1's compute on the Compute Stream could overwrite chunk N's bytes before the D2H on the Copy Stream finishes reading them.
The synchronisation primitive is the CUDA event. An event is recorded on one stream after some operation completes ("chunk N decode done"), and other streams wait on the event before proceeding ("start D2H of chunk N only after the event clears").
In the AWS / Synthesia pipeline:
- After decode chunk N kernels finish on Compute Stream → an event is recorded.
- The Copy Stream waits on that event before starting the D2H of chunk N's frames.
- After D2H chunk N finishes on Copy Stream → another event is recorded.
- The worker thread waits on that event before reading chunk N from the pinned host buffer.
This event-mediated handoff is what makes the patterns/double-buffer-cuda-events-pipeline-overlap pattern safe to compose with the dual-stream pattern.
Composition rules¶
For dual CUDA streams to actually deliver overlap:
- Pinned host destination buffers are required (see concepts/pinned-memory) — pageable memory forces D2H to stage through a pinned bounce buffer, re-serialising it onto the default stream.
- Stream non-blocking flag must be set when the dedicated Copy Stream is created (otherwise it inherits a synchronisation edge to the default stream).
- Distinct memory regions for adjacent chunks (i.e. double buffering) — same address means false aliasing, which forces serialisation regardless of streams.
- Events must be recorded and waited on correctly to maintain data dependency semantics.
Skip any one of these and the dual-stream code will run but not overlap.
Beyond compute + copy — multi-stream patterns¶
The wiki-attested pattern uses two streams (Compute + Copy). NVIDIA GPUs typically have multiple copy engines (one in each direction: H2D and D2H), so it's possible to extend to:
- Compute Stream + D2H Copy Stream + H2D Copy Stream — useful when uploading new inputs concurrent with compute and downloading prior outputs.
- Multiple Compute Streams — useful when independent compute workloads can be kernel-merged onto the same GPU but from different host threads.
The wiki currently canonicalises only the two-stream Compute+Copy case; extend this page if a future source attests the three-stream case.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation as a distinct named primitive. Compute Stream (default) + Copy Stream (dedicated) overlap compute kernels with D2H transfers in the Asynchronous Frame Generation Pipeline; CUDA events used as cross-stream barriers for safe handoff. 82% → 99.9% kernel utilisation on the Wan 2.2 14B VAE decoder.
Related¶
- concepts/pinned-memory — required for D2H on a dedicated Copy Stream to actually be async.
- concepts/device-to-host-transfer — the operation moved to the dedicated Copy Stream.
- concepts/gpu-kernel-utilization — saturation metric improved by the dual-stream split.
- concepts/async-cpu-gpu-pipelined-scheduling — same overlap shape at LLM-serving altitude using async batch dispatch instead of explicit streams.
- patterns/dual-cuda-stream-compute-and-copy-overlap — the pattern this concept underwrites.
- patterns/double-buffer-cuda-events-pipeline-overlap — the buffer + barrier composition that makes dual streams safe.
- patterns/asynchronous-frame-generation-pipeline — the umbrella pattern.