PATTERN Cited by 1 source

Asynchronous Frame Generation Pipeline¶

Definition¶

The Asynchronous Frame Generation Pipeline is the umbrella pattern AWS and Synthesia Research Engineering designed for chunked latent-diffusion video inference (specifically the VAE decoder stage). It overlaps three operations that, in a naive synchronous pipeline, would serialise:

GPU compute — VAE-decoder kernels for chunk N+1.
Device-to-host (D2H) copy — moving chunk N's decoded pixels from VRAM to host RAM.
Host-side I/O — committing chunk N's pixels from host RAM to file or downstream stage.

The pattern is the composition of two sub-patterns plus one software-side primitive:

patterns/dual-cuda-stream-compute-and-copy-overlap — splits GPU work onto two streams so compute and D2H run on the separate compute and copy engines concurrently.
patterns/double-buffer-cuda-events-pipeline-overlap — duplicates VRAM and pinned-host buffers so adjacent chunks operate on distinct memory regions, with CUDA events as cross-stream barriers.
A dedicated worker CPU thread — drains pinned host buffers to disk so the main Python thread is free to launch CUDA kernels and schedule D2H transfers.

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances; named after the post's own term.)

When to apply¶

Apply when all of these hold:

Inference is chunked — output produced in batches, where each batch must leave the GPU before being consumable.
Each chunk's output is transferred device-to-host between consecutive kernel launches.
GPU kernel utilisation is visibly below 100% in profiling, with gaps coinciding with per-chunk D2H + I/O windows.
The host-side I/O step (file write, network send, downstream hand-off) is non-trivial — i.e. blocks the main Python thread enough to delay the next kernel launch.

If any of these don't hold, simpler patterns suffice. If kernel utilisation is already 99%+, this pattern adds complexity for no gain.

Architecture¶

Main Python thread             Compute Stream      Copy Stream      Worker thread
─────────────────────────────────────────────────────────────────────────────────
launch decode chunk N  ───►  [decode N kernels]
                                    │
                                    ▼ (event: "decode N done")
                             [decode chunk N+1]    [D2H copy N]
                             on Compute Stream      on Copy Stream
                                    │                     │
                                    │                     ▼ (event: "D2H N done")
                                    │              (frees VRAM-buf-A and
                                    │               wakes worker thread)
                                    ▼
                             [decode chunk N+2]                       [worker drains
                             on Compute Stream                         pinned-host-buf-A
                             (uses VRAM-buf-A,                         to file]
                              now freed)
                                                                              │
                                                                              ▼ (event:
                                                                       "host I/O N done")
                                                                       (frees pinned-host-buf-A
                                                                        for the next D2H)

Three consequences of this layout:

Compute kernels run uninterrupted on the Compute Stream while D2H transfers and host-side file writes happen on the Copy Stream and Worker thread respectively.
Two pinned host buffers + two VRAM buffers are needed — one set "in flight" (currently being used by compute / D2H), one set "draining" (currently being read by worker / freed after worker finishes).
CUDA events mediate cross-stream + cross-thread synchronisation ("decoding of chunk N completed?", "D2H of chunk N completed?", "host I/O of chunk N completed?").

Wiki-attested results¶

On a g7e.2xlarge instance running the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder against a 41-latent-frame test video (10 consecutive cycles after warmup):

Metric	Synchronous	Asynchronous
Mean latency (s/video)	21.99	20.17
P99 latency (s/video)	22.01	20.20
GPU kernel utilisation	82%	99.9%
Real Time Factor	3.21	2.95

Latency reduction: 8.2%.
GPU stalls: visibly absent in profiling traces (Fig. 7 of the AWS post) compared to the synchronous pipeline (Fig. 6) where stalls are clearly visible between chunks.
Theoretical saving: ~$896 per 1,000 hours of decoded video per GPU at the published g7e.2xlarge price ($3.36 / GPU-hour in us-east-2).

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Table 1 + the conclusion paragraph.)

Why each piece is necessary¶

Drop any one component and the pattern doesn't work:

Without dual streams — D2H serialises with compute on the default stream → GPU stalls anyway.
Without pinned host buffers — D2H goes through a pageable bounce buffer, re-serialising onto the default stream → GPU stalls anyway.
Without double buffering — chunk N+1's compute aliases chunk N's VRAM buffer → either correctness break or implicit serialisation through buffer reuse.
Without CUDA events — worker thread reads half-written buffer / chunk N+1's compute corrupts chunk N's bytes → correctness break.
Without dedicated worker thread — main Python thread is busy writing files → not launching CUDA kernels → GPU stalls on kernel-launch latency, defeating the dual-stream overlap.

Trade-offs¶

Memory cost: 2× VRAM and 2× pinned host RAM for chunk buffers — a small fixed overhead that scales with the chunk size, not the total video length. Acceptable for chunked workloads where the per-chunk size is bounded.
Code complexity: dual streams, two buffer pools, CUDA events, worker thread, and explicit synchronisation. The reference implementation is ~hundreds of lines of careful code; correctness depends on getting all five pieces right.
Profile-driven, not auto-tuned: the right number of buffers (2 vs 3 vs more) and the right worker-thread pool size depend on profiling per workload.
Compiler / fused kernel interaction: AWS notes that kernel-utilisation gain will be larger on optimised / compiled models. Fused kernels reduce per-chunk GPU compute time, which reduces the share of wall-clock that hides D2H, which makes the per-chunk stall more visible. The pattern becomes more important on optimised models, not less.

Generalisation¶

The pattern applies to any chunked inference pipeline that transfers output to host between chunks:

Latent-diffusion video (the wiki-attested case).
Image-batch generation at batch granularity.
Streaming audio generation at audio-frame granularity.
LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM-altitude framing (post-process batch N while GPU runs batch N+1, parallelism along the time axis within a process).
Real-time video processing outside generative AI — same shape applies to e.g. on-device inference for real-time video filters where output frames must be drained to display while next frames decode.

AWS is explicit: the pattern is not specific to the Wan architecture, nor to the specific GPU utilised.

Reference implementation¶

aws-samples/sample-asynchronous-video-decoding applies the pattern to the Hugging Face Diffusers Wan 2.2 14B model in PyTorch:

1-benchmark-video-decoding.ipynb — synchronous vs asynchronous benchmark.
2-profile-video-decoding.ipynb — Nsight-style profile comparison showing kernel-stall removal.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first and only wiki appearance. AWS + Synthesia Research Engineering co-designed the pattern for VAE-decoder inference on G7e instances. Lifts kernel utilisation from 82% → 99.9%, reduces decode latency by 8.2%, drops Real Time Factor from 3.21 → 2.95. Reference implementation in PyTorch on Wan 2.2 14B; technique generalises to any chunked video generation pipeline.

patterns/dual-cuda-stream-compute-and-copy-overlap — CUDA-stream half of the pattern.
patterns/double-buffer-cuda-events-pipeline-overlap — buffer + barrier half of the pattern.
concepts/cuda-stream — primitive used.
concepts/pinned-memory — primitive used.
concepts/device-to-host-transfer — operation overlapped.
concepts/gpu-kernel-utilization — saturation metric.
concepts/latent-diffusion-video-generation — workload shape this pattern is wiki-attested on.
concepts/async-cpu-gpu-pipelined-scheduling — same overlap shape at LLM-batch-serving altitude.
systems/vae-decoder — wiki-attested target stage.
systems/aws-ec2-g7e — wiki-attested substrate.
systems/wan-video-model — wiki-attested benchmark model.
companies/synthesia — co-designer of the pattern.
companies/aws — co-designer + publisher.