PATTERN Cited by 1 source
Double-buffer + CUDA events for pipeline overlap¶
Definition¶
The double-buffer + CUDA events for pipeline overlap pattern is the safety half of GPU pipeline overlap, sibling to patterns/dual-cuda-stream-compute-and-copy-overlap: it duplicates the buffers used at each stage of the pipeline so adjacent chunks operate on distinct memory regions, and uses CUDA events as cross-stream barriers to enforce data- dependency ordering across streams and threads.
Without this pattern, dual CUDA streams are unsafe: chunk N+1's compute could overwrite chunk N's bytes before the D2H of chunk N completes, and the worker thread could read a half-written buffer. Without dual streams, this pattern is unnecessary. The two patterns are typically used together as the buffer + barrier and streaming halves of an overlapped pipeline.
Structure¶
Two buffers per side:
- VRAM buffers:
vram_buf[0],vram_buf[1]— chunk N's compute writes into one, chunk N+1's compute writes into the other; the in-flight D2H reads from whichever was just written. - Pinned host buffers:
host_buf[0],host_buf[1]— chunk N's D2H lands in one, chunk N+1's D2H lands in the other; the worker thread reads whichever the in-flight D2H just finished writing.
CUDA events as barriers:
done_decode[i]— recorded on the Compute Stream after chunk N's kernels finish. The Copy Stream waits on this event before issuing the D2H.done_d2h[i]— recorded on the Copy Stream after the D2H of chunk N finishes. The worker thread waits on this event before reading the pinned host buffer.done_host_io[i]— recorded by the worker thread after the file write finishes. The next iteration of decode that re-uses bufferiwaits on this event before clobbering the buffer.
Iteration Compute Stream Copy Stream Worker thread
───────── ───────────────── ────────────────── ──────────────────
N=0 decode → buf[0]
record done_dec[0]
N=1 decode → buf[1] wait done_dec[0]
record done_dec[1] D2H buf[0]→host[0]
record done_d2h[0]
N=2 wait done_io[0] wait done_dec[1] wait done_d2h[0]
decode → buf[0] D2H buf[1]→host[1] write host[0] to file
record done_dec[0] record done_d2h[1] record done_io[0]
N=3 wait done_io[1] wait done_dec[0] wait done_d2h[1]
decode → buf[1] D2H buf[0]→host[0] write host[1] to file
… … …
After steady state (N ≥ 2), all three lanes (Compute, Copy, Worker) are concurrently busy on distinct buffers — exactly the overlap the pattern is designed for.
Why each barrier is necessary¶
done_decode[i]: Without it, the Copy Stream could start D2H before the Compute Stream finishes the kernels writing bufferi→ garbage in host memory.done_d2h[i]: Without it, the worker thread could read the pinned host buffer mid-D2H → torn output.done_host_io[i]: Without it, the next decode pass could overwrite the VRAM buffer (or its associated pinned host buffer slot) before the worker has finished consuming the prior chunk → silent data loss.
CUDA events are the right primitive here because they:
- Cross stream boundaries — record on one stream, wait on another.
- Cross thread boundaries — the worker thread can wait on a CUDA event from any host thread.
- Are non-blocking to record (recording is enqueued onto the stream, not synchronous on the host).
- Have a clear "has this happened?" semantic — the AWS post describes events as "a barrier that clears if it can answer closed questions such as: Has decoding of chunk N completed?".
When to apply¶
- Together with patterns/dual-cuda-stream-compute-and-copy-overlap — this pattern is what makes that pattern safe.
- Whenever pipeline overlap is desired across distinct hardware lanes (compute engine + copy engine + host I/O thread), with shared memory regions that need ordering guarantees.
For a single-stream, single-thread pipeline, plain CUDA stream ordering already gives the necessary guarantees. This pattern becomes necessary only when overlap demands sharing buffers across stream + thread boundaries.
Trade-offs¶
- 2× the buffer footprint per stage — for the wiki-attested Wan 2.2 14B VAE decoder, this is 2× chunk-sized buffers in VRAM and 2× chunk-sized pinned host buffers. Acceptable because chunk size is bounded.
- Triple-buffering or larger can hide more variance (e.g. if worker-thread file I/O latency is highly variable). The AWS / Synthesia post uses double buffering; the right number is profile-driven.
- Correctness depends on getting all three barriers right. Missing any one barrier creates a silent race that is difficult to reproduce in production but corrupts output.
- Pinned host buffers are scarce kernel resources (see concepts/pinned-memory). Pre-allocate, reuse, never per-request allocate.
Generalisation¶
The pattern is the GPU-pipeline instance of the more general N-buffer + barrier handoff pattern that recurs in:
- Producer-consumer queues with bounded slots.
- Lock-free ring buffers between threads.
- Async I/O pipelines (e.g.
io_uringSQE/CQE rings). - Hardware DMA pipelines outside of GPUs.
The CUDA-event part is the GPU-specific sync primitive; the double-buffer part is universal.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. Two VRAM + two pinned host buffers used as the buffer + barrier half of the Asynchronous Frame Generation Pipeline. CUDA events are described as "a barrier that clears if it can answer closed questions such as: Has decoding of chunk N completed?".
Related¶
- concepts/cuda-stream — partner primitive.
- concepts/pinned-memory — required for the host-side buffers.
- concepts/device-to-host-transfer — operation between buffers.
- concepts/gpu-kernel-utilization — metric whose improvement this pattern enables (alongside dual streams).
- patterns/asynchronous-frame-generation-pipeline — umbrella pattern.
- patterns/dual-cuda-stream-compute-and-copy-overlap — paired sibling pattern.