Skip to content

PATTERN Cited by 1 source

Double-buffer + CUDA events for pipeline overlap

Definition

The double-buffer + CUDA events for pipeline overlap pattern is the safety half of GPU pipeline overlap, sibling to patterns/dual-cuda-stream-compute-and-copy-overlap: it duplicates the buffers used at each stage of the pipeline so adjacent chunks operate on distinct memory regions, and uses CUDA events as cross-stream barriers to enforce data- dependency ordering across streams and threads.

Without this pattern, dual CUDA streams are unsafe: chunk N+1's compute could overwrite chunk N's bytes before the D2H of chunk N completes, and the worker thread could read a half-written buffer. Without dual streams, this pattern is unnecessary. The two patterns are typically used together as the buffer + barrier and streaming halves of an overlapped pipeline.

Structure

Two buffers per side:

  • VRAM buffers: vram_buf[0], vram_buf[1] — chunk N's compute writes into one, chunk N+1's compute writes into the other; the in-flight D2H reads from whichever was just written.
  • Pinned host buffers: host_buf[0], host_buf[1] — chunk N's D2H lands in one, chunk N+1's D2H lands in the other; the worker thread reads whichever the in-flight D2H just finished writing.

CUDA events as barriers:

  • done_decode[i] — recorded on the Compute Stream after chunk N's kernels finish. The Copy Stream waits on this event before issuing the D2H.
  • done_d2h[i] — recorded on the Copy Stream after the D2H of chunk N finishes. The worker thread waits on this event before reading the pinned host buffer.
  • done_host_io[i] — recorded by the worker thread after the file write finishes. The next iteration of decode that re-uses buffer i waits on this event before clobbering the buffer.
Iteration  Compute Stream     Copy Stream         Worker thread
─────────  ─────────────────  ──────────────────  ──────────────────
N=0        decode → buf[0]
           record done_dec[0]
N=1        decode → buf[1]    wait done_dec[0]
           record done_dec[1] D2H buf[0]→host[0]
                              record done_d2h[0]
N=2        wait done_io[0]    wait done_dec[1]    wait done_d2h[0]
           decode → buf[0]    D2H buf[1]→host[1]  write host[0] to file
           record done_dec[0] record done_d2h[1]  record done_io[0]
N=3        wait done_io[1]    wait done_dec[0]    wait done_d2h[1]
           decode → buf[1]    D2H buf[0]→host[0]  write host[1] to file
           …                  …                   …

After steady state (N ≥ 2), all three lanes (Compute, Copy, Worker) are concurrently busy on distinct buffers — exactly the overlap the pattern is designed for.

Why each barrier is necessary

  • done_decode[i]: Without it, the Copy Stream could start D2H before the Compute Stream finishes the kernels writing buffer i → garbage in host memory.
  • done_d2h[i]: Without it, the worker thread could read the pinned host buffer mid-D2H → torn output.
  • done_host_io[i]: Without it, the next decode pass could overwrite the VRAM buffer (or its associated pinned host buffer slot) before the worker has finished consuming the prior chunk → silent data loss.

CUDA events are the right primitive here because they:

  • Cross stream boundaries — record on one stream, wait on another.
  • Cross thread boundaries — the worker thread can wait on a CUDA event from any host thread.
  • Are non-blocking to record (recording is enqueued onto the stream, not synchronous on the host).
  • Have a clear "has this happened?" semantic — the AWS post describes events as "a barrier that clears if it can answer closed questions such as: Has decoding of chunk N completed?".

When to apply

  • Together with patterns/dual-cuda-stream-compute-and-copy-overlap — this pattern is what makes that pattern safe.
  • Whenever pipeline overlap is desired across distinct hardware lanes (compute engine + copy engine + host I/O thread), with shared memory regions that need ordering guarantees.

For a single-stream, single-thread pipeline, plain CUDA stream ordering already gives the necessary guarantees. This pattern becomes necessary only when overlap demands sharing buffers across stream + thread boundaries.

Trade-offs

  • 2× the buffer footprint per stage — for the wiki-attested Wan 2.2 14B VAE decoder, this is 2× chunk-sized buffers in VRAM and 2× chunk-sized pinned host buffers. Acceptable because chunk size is bounded.
  • Triple-buffering or larger can hide more variance (e.g. if worker-thread file I/O latency is highly variable). The AWS / Synthesia post uses double buffering; the right number is profile-driven.
  • Correctness depends on getting all three barriers right. Missing any one barrier creates a silent race that is difficult to reproduce in production but corrupts output.
  • Pinned host buffers are scarce kernel resources (see concepts/pinned-memory). Pre-allocate, reuse, never per-request allocate.

Generalisation

The pattern is the GPU-pipeline instance of the more general N-buffer + barrier handoff pattern that recurs in:

  • Producer-consumer queues with bounded slots.
  • Lock-free ring buffers between threads.
  • Async I/O pipelines (e.g. io_uring SQE/CQE rings).
  • Hardware DMA pipelines outside of GPUs.

The CUDA-event part is the GPU-specific sync primitive; the double-buffer part is universal.

Seen in

Last updated · 542 distilled / 1,571 read