PATTERN Cited by 1 source
Asynchronous Frame Generation Pipeline¶
Definition¶
The Asynchronous Frame Generation Pipeline is the umbrella pattern AWS and Synthesia Research Engineering designed for chunked latent-diffusion video inference (specifically the VAE decoder stage). It overlaps three operations that, in a naive synchronous pipeline, would serialise:
- GPU compute — VAE-decoder kernels for chunk N+1.
- Device-to-host (D2H) copy — moving chunk N's decoded pixels from VRAM to host RAM.
- Host-side I/O — committing chunk N's pixels from host RAM to file or downstream stage.
The pattern is the composition of two sub-patterns plus one software-side primitive:
- patterns/dual-cuda-stream-compute-and-copy-overlap — splits GPU work onto two streams so compute and D2H run on the separate compute and copy engines concurrently.
- patterns/double-buffer-cuda-events-pipeline-overlap — duplicates VRAM and pinned-host buffers so adjacent chunks operate on distinct memory regions, with CUDA events as cross-stream barriers.
- A dedicated worker CPU thread — drains pinned host buffers to disk so the main Python thread is free to launch CUDA kernels and schedule D2H transfers.
(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances; named after the post's own term.)
When to apply¶
Apply when all of these hold:
- Inference is chunked — output produced in batches, where each batch must leave the GPU before being consumable.
- Each chunk's output is transferred device-to-host between consecutive kernel launches.
- GPU kernel utilisation is visibly below 100% in profiling, with gaps coinciding with per-chunk D2H + I/O windows.
- The host-side I/O step (file write, network send, downstream hand-off) is non-trivial — i.e. blocks the main Python thread enough to delay the next kernel launch.
If any of these don't hold, simpler patterns suffice. If kernel utilisation is already 99%+, this pattern adds complexity for no gain.
Architecture¶
Main Python thread Compute Stream Copy Stream Worker thread
─────────────────────────────────────────────────────────────────────────────────
launch decode chunk N ───► [decode N kernels]
│
▼ (event: "decode N done")
[decode chunk N+1] [D2H copy N]
on Compute Stream on Copy Stream
│ │
│ ▼ (event: "D2H N done")
│ (frees VRAM-buf-A and
│ wakes worker thread)
▼
[decode chunk N+2] [worker drains
on Compute Stream pinned-host-buf-A
(uses VRAM-buf-A, to file]
now freed)
│
▼ (event:
"host I/O N done")
(frees pinned-host-buf-A
for the next D2H)
Three consequences of this layout:
- Compute kernels run uninterrupted on the Compute Stream while D2H transfers and host-side file writes happen on the Copy Stream and Worker thread respectively.
- Two pinned host buffers + two VRAM buffers are needed — one set "in flight" (currently being used by compute / D2H), one set "draining" (currently being read by worker / freed after worker finishes).
- CUDA events mediate cross-stream + cross-thread synchronisation ("decoding of chunk N completed?", "D2H of chunk N completed?", "host I/O of chunk N completed?").
Wiki-attested results¶
On a g7e.2xlarge instance running the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder against a 41-latent-frame test video (10 consecutive cycles after warmup):
| Metric | Synchronous | Asynchronous |
|---|---|---|
| Mean latency (s/video) | 21.99 | 20.17 |
| P99 latency (s/video) | 22.01 | 20.20 |
| GPU kernel utilisation | 82% | 99.9% |
| Real Time Factor | 3.21 | 2.95 |
- Latency reduction: 8.2%.
- GPU stalls: visibly absent in profiling traces (Fig. 7 of the AWS post) compared to the synchronous pipeline (Fig. 6) where stalls are clearly visible between chunks.
- Theoretical saving: ~$896 per 1,000 hours of decoded video
per GPU at the published g7e.2xlarge price ($3.36 / GPU-hour
in
us-east-2).
(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Table 1 + the conclusion paragraph.)
Why each piece is necessary¶
Drop any one component and the pattern doesn't work:
- Without dual streams — D2H serialises with compute on the default stream → GPU stalls anyway.
- Without pinned host buffers — D2H goes through a pageable bounce buffer, re-serialising onto the default stream → GPU stalls anyway.
- Without double buffering — chunk N+1's compute aliases chunk N's VRAM buffer → either correctness break or implicit serialisation through buffer reuse.
- Without CUDA events — worker thread reads half-written buffer / chunk N+1's compute corrupts chunk N's bytes → correctness break.
- Without dedicated worker thread — main Python thread is busy writing files → not launching CUDA kernels → GPU stalls on kernel-launch latency, defeating the dual-stream overlap.
Trade-offs¶
- Memory cost: 2× VRAM and 2× pinned host RAM for chunk buffers — a small fixed overhead that scales with the chunk size, not the total video length. Acceptable for chunked workloads where the per-chunk size is bounded.
- Code complexity: dual streams, two buffer pools, CUDA events, worker thread, and explicit synchronisation. The reference implementation is ~hundreds of lines of careful code; correctness depends on getting all five pieces right.
- Profile-driven, not auto-tuned: the right number of buffers (2 vs 3 vs more) and the right worker-thread pool size depend on profiling per workload.
- Compiler / fused kernel interaction: AWS notes that kernel-utilisation gain will be larger on optimised / compiled models. Fused kernels reduce per-chunk GPU compute time, which reduces the share of wall-clock that hides D2H, which makes the per-chunk stall more visible. The pattern becomes more important on optimised models, not less.
Generalisation¶
The pattern applies to any chunked inference pipeline that transfers output to host between chunks:
- Latent-diffusion video (the wiki-attested case).
- Image-batch generation at batch granularity.
- Streaming audio generation at audio-frame granularity.
- LLM batch serving at the post-processing boundary — see concepts/async-cpu-gpu-pipelined-scheduling for the LLM-altitude framing (post-process batch N while GPU runs batch N+1, parallelism along the time axis within a process).
- Real-time video processing outside generative AI — same shape applies to e.g. on-device inference for real-time video filters where output frames must be drained to display while next frames decode.
AWS is explicit: the pattern is not specific to the Wan architecture, nor to the specific GPU utilised.
Reference implementation¶
aws-samples/sample-asynchronous-video-decoding applies the pattern to the Hugging Face Diffusers Wan 2.2 14B model in PyTorch:
1-benchmark-video-decoding.ipynb— synchronous vs asynchronous benchmark.2-profile-video-decoding.ipynb— Nsight-style profile comparison showing kernel-stall removal.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first and only wiki appearance. AWS + Synthesia Research Engineering co-designed the pattern for VAE-decoder inference on G7e instances. Lifts kernel utilisation from 82% → 99.9%, reduces decode latency by 8.2%, drops Real Time Factor from 3.21 → 2.95. Reference implementation in PyTorch on Wan 2.2 14B; technique generalises to any chunked video generation pipeline.
Related¶
- patterns/dual-cuda-stream-compute-and-copy-overlap — CUDA-stream half of the pattern.
- patterns/double-buffer-cuda-events-pipeline-overlap — buffer + barrier half of the pattern.
- concepts/cuda-stream — primitive used.
- concepts/pinned-memory — primitive used.
- concepts/device-to-host-transfer — operation overlapped.
- concepts/gpu-kernel-utilization — saturation metric.
- concepts/latent-diffusion-video-generation — workload shape this pattern is wiki-attested on.
- concepts/async-cpu-gpu-pipelined-scheduling — same overlap shape at LLM-batch-serving altitude.
- systems/vae-decoder — wiki-attested target stage.
- systems/aws-ec2-g7e — wiki-attested substrate.
- systems/wan-video-model — wiki-attested benchmark model.
- companies/synthesia — co-designer of the pattern.
- companies/aws — co-designer + publisher.