Skip to content

PATTERN Cited by 1 source

Asynchronous Frame Generation Pipeline

Definition

The Asynchronous Frame Generation Pipeline is the umbrella pattern AWS and Synthesia Research Engineering designed for chunked latent-diffusion video inference (specifically the VAE decoder stage). It overlaps three operations that, in a naive synchronous pipeline, would serialise:

  1. GPU compute — VAE-decoder kernels for chunk N+1.
  2. Device-to-host (D2H) copy — moving chunk N's decoded pixels from VRAM to host RAM.
  3. Host-side I/O — committing chunk N's pixels from host RAM to file or downstream stage.

The pattern is the composition of two sub-patterns plus one software-side primitive:

  • patterns/dual-cuda-stream-compute-and-copy-overlap — splits GPU work onto two streams so compute and D2H run on the separate compute and copy engines concurrently.
  • patterns/double-buffer-cuda-events-pipeline-overlap — duplicates VRAM and pinned-host buffers so adjacent chunks operate on distinct memory regions, with CUDA events as cross-stream barriers.
  • A dedicated worker CPU thread — drains pinned host buffers to disk so the main Python thread is free to launch CUDA kernels and schedule D2H transfers.

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances; named after the post's own term.)

When to apply

Apply when all of these hold:

  • Inference is chunked — output produced in batches, where each batch must leave the GPU before being consumable.
  • Each chunk's output is transferred device-to-host between consecutive kernel launches.
  • GPU kernel utilisation is visibly below 100% in profiling, with gaps coinciding with per-chunk D2H + I/O windows.
  • The host-side I/O step (file write, network send, downstream hand-off) is non-trivial — i.e. blocks the main Python thread enough to delay the next kernel launch.

If any of these don't hold, simpler patterns suffice. If kernel utilisation is already 99%+, this pattern adds complexity for no gain.

Architecture

Main Python thread             Compute Stream      Copy Stream      Worker thread
─────────────────────────────────────────────────────────────────────────────────
launch decode chunk N  ───►  [decode N kernels]
                                    ▼ (event: "decode N done")
                             [decode chunk N+1]    [D2H copy N]
                             on Compute Stream      on Copy Stream
                                    │                     │
                                    │                     ▼ (event: "D2H N done")
                                    │              (frees VRAM-buf-A and
                                    │               wakes worker thread)
                             [decode chunk N+2]                       [worker drains
                             on Compute Stream                         pinned-host-buf-A
                             (uses VRAM-buf-A,                         to file]
                              now freed)
                                                                              ▼ (event:
                                                                       "host I/O N done")
                                                                       (frees pinned-host-buf-A
                                                                        for the next D2H)

Three consequences of this layout:

  • Compute kernels run uninterrupted on the Compute Stream while D2H transfers and host-side file writes happen on the Copy Stream and Worker thread respectively.
  • Two pinned host buffers + two VRAM buffers are needed — one set "in flight" (currently being used by compute / D2H), one set "draining" (currently being read by worker / freed after worker finishes).
  • CUDA events mediate cross-stream + cross-thread synchronisation ("decoding of chunk N completed?", "D2H of chunk N completed?", "host I/O of chunk N completed?").

Wiki-attested results

On a g7e.2xlarge instance running the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder against a 41-latent-frame test video (10 consecutive cycles after warmup):

Metric Synchronous Asynchronous
Mean latency (s/video) 21.99 20.17
P99 latency (s/video) 22.01 20.20
GPU kernel utilisation 82% 99.9%
Real Time Factor 3.21 2.95
  • Latency reduction: 8.2%.
  • GPU stalls: visibly absent in profiling traces (Fig. 7 of the AWS post) compared to the synchronous pipeline (Fig. 6) where stalls are clearly visible between chunks.
  • Theoretical saving: ~$896 per 1,000 hours of decoded video per GPU at the published g7e.2xlarge price ($3.36 / GPU-hour in us-east-2).

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Table 1 + the conclusion paragraph.)

Why each piece is necessary

Drop any one component and the pattern doesn't work:

  • Without dual streams — D2H serialises with compute on the default stream → GPU stalls anyway.
  • Without pinned host buffers — D2H goes through a pageable bounce buffer, re-serialising onto the default stream → GPU stalls anyway.
  • Without double buffering — chunk N+1's compute aliases chunk N's VRAM buffer → either correctness break or implicit serialisation through buffer reuse.
  • Without CUDA events — worker thread reads half-written buffer / chunk N+1's compute corrupts chunk N's bytes → correctness break.
  • Without dedicated worker thread — main Python thread is busy writing files → not launching CUDA kernels → GPU stalls on kernel-launch latency, defeating the dual-stream overlap.

Trade-offs

  • Memory cost: 2× VRAM and 2× pinned host RAM for chunk buffers — a small fixed overhead that scales with the chunk size, not the total video length. Acceptable for chunked workloads where the per-chunk size is bounded.
  • Code complexity: dual streams, two buffer pools, CUDA events, worker thread, and explicit synchronisation. The reference implementation is ~hundreds of lines of careful code; correctness depends on getting all five pieces right.
  • Profile-driven, not auto-tuned: the right number of buffers (2 vs 3 vs more) and the right worker-thread pool size depend on profiling per workload.
  • Compiler / fused kernel interaction: AWS notes that kernel-utilisation gain will be larger on optimised / compiled models. Fused kernels reduce per-chunk GPU compute time, which reduces the share of wall-clock that hides D2H, which makes the per-chunk stall more visible. The pattern becomes more important on optimised models, not less.

Generalisation

The pattern applies to any chunked inference pipeline that transfers output to host between chunks:

  • Latent-diffusion video (the wiki-attested case).
  • Image-batch generation at batch granularity.
  • Streaming audio generation at audio-frame granularity.
  • LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM-altitude framing (post-process batch N while GPU runs batch N+1, parallelism along the time axis within a process).
  • Real-time video processing outside generative AI — same shape applies to e.g. on-device inference for real-time video filters where output frames must be drained to display while next frames decode.

AWS is explicit: the pattern is not specific to the Wan architecture, nor to the specific GPU utilised.

Reference implementation

aws-samples/sample-asynchronous-video-decoding applies the pattern to the Hugging Face Diffusers Wan 2.2 14B model in PyTorch:

  • 1-benchmark-video-decoding.ipynb — synchronous vs asynchronous benchmark.
  • 2-profile-video-decoding.ipynb — Nsight-style profile comparison showing kernel-stall removal.

Seen in

Last updated · 542 distilled / 1,571 read