PATTERN Cited by 1 source

Double-buffer + CUDA events for pipeline overlap¶

Definition¶

The double-buffer + CUDA events for pipeline overlap pattern is the safety half of GPU pipeline overlap, sibling to patterns/dual-cuda-stream-compute-and-copy-overlap: it duplicates the buffers used at each stage of the pipeline so adjacent chunks operate on distinct memory regions, and uses CUDA events as cross-stream barriers to enforce data- dependency ordering across streams and threads.

Without this pattern, dual CUDA streams are unsafe: chunk N+1's compute could overwrite chunk N's bytes before the D2H of chunk N completes, and the worker thread could read a half-written buffer. Without dual streams, this pattern is unnecessary. The two patterns are typically used together as the buffer + barrier and streaming halves of an overlapped pipeline.

Structure¶

Two buffers per side:

VRAM buffers: vram_buf[0], vram_buf[1] — chunk N's compute writes into one, chunk N+1's compute writes into the other; the in-flight D2H reads from whichever was just written.
Pinned host buffers: host_buf[0], host_buf[1] — chunk N's D2H lands in one, chunk N+1's D2H lands in the other; the worker thread reads whichever the in-flight D2H just finished writing.

CUDA events as barriers:

done_decode[i] — recorded on the Compute Stream after chunk N's kernels finish. The Copy Stream waits on this event before issuing the D2H.
done_d2h[i] — recorded on the Copy Stream after the D2H of chunk N finishes. The worker thread waits on this event before reading the pinned host buffer.
done_host_io[i] — recorded by the worker thread after the file write finishes. The next iteration of decode that re-uses buffer i waits on this event before clobbering the buffer.

Iteration  Compute Stream     Copy Stream         Worker thread
─────────  ─────────────────  ──────────────────  ──────────────────
N=0        decode → buf[0]
           record done_dec[0]
N=1        decode → buf[1]    wait done_dec[0]
           record done_dec[1] D2H buf[0]→host[0]
                              record done_d2h[0]
N=2        wait done_io[0]    wait done_dec[1]    wait done_d2h[0]
           decode → buf[0]    D2H buf[1]→host[1]  write host[0] to file
           record done_dec[0] record done_d2h[1]  record done_io[0]
N=3        wait done_io[1]    wait done_dec[0]    wait done_d2h[1]
           decode → buf[1]    D2H buf[0]→host[0]  write host[1] to file
           …                  …                   …

After steady state (N ≥ 2), all three lanes (Compute, Copy, Worker) are concurrently busy on distinct buffers — exactly the overlap the pattern is designed for.

Why each barrier is necessary¶

done_decode[i]: Without it, the Copy Stream could start D2H before the Compute Stream finishes the kernels writing buffer i → garbage in host memory.
done_d2h[i]: Without it, the worker thread could read the pinned host buffer mid-D2H → torn output.
done_host_io[i]: Without it, the next decode pass could overwrite the VRAM buffer (or its associated pinned host buffer slot) before the worker has finished consuming the prior chunk → silent data loss.

CUDA events are the right primitive here because they:

Cross stream boundaries — record on one stream, wait on another.
Cross thread boundaries — the worker thread can wait on a CUDA event from any host thread.
Are non-blocking to record (recording is enqueued onto the stream, not synchronous on the host).
Have a clear "has this happened?" semantic — the AWS post describes events as "a barrier that clears if it can answer closed questions such as: Has decoding of chunk N completed?".

When to apply¶

Together with patterns/dual-cuda-stream-compute-and-copy-overlap — this pattern is what makes that pattern safe.
Whenever pipeline overlap is desired across distinct hardware lanes (compute engine + copy engine + host I/O thread), with shared memory regions that need ordering guarantees.

For a single-stream, single-thread pipeline, plain CUDA stream ordering already gives the necessary guarantees. This pattern becomes necessary only when overlap demands sharing buffers across stream + thread boundaries.

Trade-offs¶

2× the buffer footprint per stage — for the wiki-attested Wan 2.2 14B VAE decoder, this is 2× chunk-sized buffers in VRAM and 2× chunk-sized pinned host buffers. Acceptable because chunk size is bounded.
Triple-buffering or larger can hide more variance (e.g. if worker-thread file I/O latency is highly variable). The AWS / Synthesia post uses double buffering; the right number is profile-driven.
Correctness depends on getting all three barriers right. Missing any one barrier creates a silent race that is difficult to reproduce in production but corrupts output.
Pinned host buffers are scarce kernel resources (see concepts/pinned-memory). Pre-allocate, reuse, never per-request allocate.

Generalisation¶

The pattern is the GPU-pipeline instance of the more general N-buffer + barrier handoff pattern that recurs in:

Producer-consumer queues with bounded slots.
Lock-free ring buffers between threads.
Async I/O pipelines (e.g. io_uring SQE/CQE rings).
Hardware DMA pipelines outside of GPUs.

The CUDA-event part is the GPU-specific sync primitive; the double-buffer part is universal.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. Two VRAM + two pinned host buffers used as the buffer + barrier half of the Asynchronous Frame Generation Pipeline. CUDA events are described as "a barrier that clears if it can answer closed questions such as: Has decoding of chunk N completed?".

concepts/cuda-stream — partner primitive.
concepts/pinned-memory — required for the host-side buffers.
concepts/device-to-host-transfer — operation between buffers.
concepts/gpu-kernel-utilization — metric whose improvement this pattern enables (alongside dual streams).
patterns/asynchronous-frame-generation-pipeline — umbrella pattern.
patterns/dual-cuda-stream-compute-and-copy-overlap — paired sibling pattern.