AWS Tier 1

AWS — How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances¶

Summary¶

AWS Architecture Blog post (2026-05-19) co-authored with Synthesia Research Engineering describing a video-decoding optimisation technique — the Asynchronous Frame Generation Pipeline — applied to the VAE decoder stage of a latent-diffusion video generation model running on Amazon EC2 G7e instances (NVIDIA RTX PRO 6000 Blackwell GPUs, 96 GB GPU memory). The bottleneck the post addresses is well-defined: when generating videos using AI models with a VAE decoder in the architecture, GPU utilisation is bottlenecked by the rate at which decoded video frames are saved to host storage. Each chunk of decoded frames must be transferred device-to-host (D2H) and written to a file before the next chunk's CUDA kernels can be launched, leaving the GPU stalled between chunks.

The fix decouples three things — GPU compute, D2H copy, and host-side I/O — so they overlap in a pipelined fashion instead of serialising. Three primitives carry the load: (1) two CUDA streams (a "Compute Stream" for kernels and a dedicated "Copy Stream" for D2H), (2) a double buffer on both VRAM and pinned host RAM, with CUDA events as cross-stream barriers, and (3) a dedicated worker CPU thread that drains pinned host buffers to disk while the main Python thread keeps launching kernels. The Asynchronous Frame Generation Pipeline is what you build out of those three primitives so adjacent chunks operate on distinct memory regions and never block each other.

Benchmarks on the unoptimised Hugging Face Diffusers Wan 2.2 14B model on a g7e.2xlarge instance (10 consecutive 41-latent-frame decode runs after warmup) show:

Metric	Synchronous (s/video)	Asynchronous (s/video)
min	21.98	20.16
mean	21.99	20.17
P99	22.01	20.20

That's an 8.2% latency reduction and a corresponding throughput gain. GPU kernel utilisation rises from 82% to 99.9% in the steady state for two consecutive chunks. The Real Time Factor (decode-time / video-duration) drops from 3.21 to 2.95. At g7e.2xlarge on-demand pricing of $3.36 / GPU-hour in Ohio, this is a theoretical saving of ~$896 per 1,000 hours of decoded video on a single GPU. The gain comes purely from better hardware utilisation — no model-weight or quality changes — and AWS expects the gain to be even larger on optimised / compiled models that use the GPU more efficiently.

The post is explicit that the technique is not specific to the Wan architecture or to G7e: any chunked video generation pipeline that transfers frames to host memory can apply it. AWS published an end-to-end PyTorch reference implementation on the Hugging Face Diffusers Wan 2.2 14B model.

Key takeaways¶

The VAE decoder is the inference bottleneck for latent-diffusion video, not the diffusion process itself. Latent-diffusion video models do the iterative denoising in compressed latent space and then need a final VAE-decoder pass to produce the human-readable pixel video. The pixel video is too large to hold in VRAM in one piece for arbitrarily-long videos, so decoding is chunked along the temporal dimension — typically one latent frame produces four time-consecutive pixel frames. That chunked structure is what makes the D2H + storage-write step a per-chunk serialisation point. (Source: "This causes GPU stalls and reduces average GPU kernel utilization"; see also concepts/latent-diffusion-video-generation.)
Synchronous decoding stalls the GPU between chunks. In the Sequential Frame Generation Pipeline, chunk N's frames must complete D2H copy + storage commit before the CPU launches CUDA kernels for chunk N+1. AWS's profile shows this as a visible idle gap in the GPU trace between chunks — the GPU is waiting on the CPU thread that is busy writing files. (Source: Fig. 6, "the GPU stream stalls, waiting for the CPU to launch the kernels needed to process Chunk N+1".)
Two CUDA streams decouple compute from copy. PyTorch by default schedules everything onto a single CUDA stream per device, so D2H copies serialise behind compute kernels. The fix is to enqueue compute kernels on the default (Compute) stream and D2H copies on a dedicated Copy Stream. CUDA can then overlap them on the GPU's copy engine, since compute kernels and memory-transfer engines are physically separate hardware. See concepts/cuda-stream and patterns/dual-cuda-stream-compute-and-copy-overlap. (Source: AWS post, "compute kernels are enqueued on the default stream...and D2H copies on a dedicated copy stream".)
A worker CPU thread takes host-side I/O off the main Python thread. Even with two CUDA streams, file-writing on the main Python thread would steal CPU time that should be launching kernels. The Asynchronous Frame Generation Pipeline introduces a dedicated Worker thread whose job is to read chunks from pinned host RAM and write them to file, leaving the main thread free to launch CUDA kernels and schedule D2H transfers. (Source: "A dedicated Worker CPU thread responsible for reading chunks from Host Memory (RAM), and writing them to file, leaving the main Python thread to focus on launching kernels and scheduling D2H transfers".)
Pinned (page-locked) host buffers are required for fully-async D2H. Pageable host memory forces CUDA to stage D2H copies through a pinned bounce buffer, which serialises with compute. Page-locking the destination buffers in host RAM lets the GPU's copy engine DMA the bytes directly while compute keeps running. See concepts/pinned-memory. (Source: "Two in-memory Buffers on the GPU Memory (VRAM), and on the Host Memory (RAM), and page-lock the required Host Memory buffers to make sure D2H copies are performed fully asynchronously".)
Double-buffering plus CUDA events make the overlap safe. A single buffer per side would race: chunk N+1's compute would overwrite chunk N's bytes mid-D2H, and the worker thread might read a half-written buffer. Two buffers per side (one "in-flight", one "draining") plus CUDA events as cross-stream barriers — "Has decoding of chunk N completed?" — guarantee adjacent chunks operate on distinct memory regions. See patterns/double-buffer-cuda-events-pipeline-overlap. (Source: "Using a double-buffer strategy makes sure that for adjacent chunks the compute, D2H transfer, and host processing can overlap safely as they operate on distinct memory buffers".)
Quantified gain: 82% → 99.9% kernel utilisation, 8.2% latency reduction, RTF 3.21 → 2.95. All numbers are on the unoptimised Diffusers Wan 2.2 14B model on g7e.2xlarge across ten consecutive decoding cycles after a warmup decode. The difference between min, mean, and P99 is in the third decimal — the optimisation is stable, not tail-bound. AWS notes the gain will be larger on optimised / compiled models that already make better use of GPU compute, since the shorter the GPU stage, the more the per-chunk-stall fraction matters. (Source: Table 1 + the conclusion paragraph; see also concepts/gpu-kernel-utilization and concepts/real-time-factor.)
The technique generalises beyond Wan and beyond G7e. AWS is explicit that any chunked video generation pipeline that transfers frames to host memory can adopt the same structure, regardless of GPU architecture or model family. The reference implementation is in PyTorch but the primitives — dual streams, pinned buffers, worker thread, CUDA events — exist in every CUDA host-language binding. (Source: "The techniques presented here are not specific to the Wan architecture, nor to the specific GPU utilized".)

Architectural numbers¶

g7e.2xlarge instance: 1 × NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB of GPU memory (used for GPU-memory-intensive generative AI video models).
g7e.2xlarge price: $3.36 / GPU on-demand in us-east-2 (Ohio) at time of writing.
Test workload: 41-latent-frame test video decoded by the unoptimised Hugging Face Diffusers Wan 2.2 14B VAE decoder.
Chunking: each latent frame yields a chunk of ~4 time- consecutive pixel frames after VAE decoding.
Benchmark protocol: 1 warmup decode cycle (initialise CUDA + PyTorch memory pools and cache) + 10 consecutive full-video decoding cycles.
Latency: synchronous mean 21.99 s/video, asynchronous mean 20.17 s/video → −8.2%; P99 21.99 → 20.20 (Δ −1.79 s).
GPU kernel utilisation (steady state, two consecutive chunks): 82% → 99.9% (a +17.9 percentage-point gain).
Real Time Factor (RTF): 3.21 → 2.95.
Theoretical saving: ~$896 per 1,000 hours of decoded video per single GPU. ("Theoretical" because it assumes the underlying model is already at full computational efficiency without other bottlenecks.)
Hardware primitives used: 2 × CUDA streams (Compute + Copy), 2 × VRAM buffer + 2 × pinned host RAM buffer, 1 × worker CPU thread, CUDA events as cross-stream barriers.

Caveats¶

Theoretical-savings claim assumes no other bottleneck. AWS is upfront that the $896/1k-hours figure assumes the underlying model is operating at full computational efficiency — i.e. the only bottleneck removed is the GPU-stall-on-D2H, not e.g. attention bandwidth, KV-cache pressure, or model-loading time. Actual savings depend on whether D2H stalls were the binding constraint for a given workload.
Single-instance benchmark. Numbers are reported on a single g7e.2xlarge running ten consecutive decode cycles. No multi-instance, multi-tenant, or multi-region serving data is provided; the architecture posture is the per-decoder level inside one model-serving worker.
Specific to chunked decoding pipelines. The technique presupposes a pipeline that already produces decoded frames in per-chunk batches and transfers them to host. A non-chunked pipeline that holds the whole decoded video in VRAM until done doesn't have this bottleneck — but doesn't scale to long videos either. (Memory-efficiency is the implicit reason the chunked pipeline exists in the first place.)
Synthesia's in-house production models are not disclosed. Synthesia uses "in-house developed models...based on various architectures, including latent diffusion video generation models." The AWS post benchmarks on the open-source Hugging Face Diffusers Wan 2.2 14B for reproducibility; Synthesia's actual production-deployed model isn't named. The technique applies, but the realised production gain at Synthesia is not quantified in the post.
Compiled / optimised models will see larger relative gains. AWS notes the kernel-utilisation gain will be more impactful on optimised models — by implication, on the unoptimised Diffusers baseline used here, the GPU stage was already long enough to amortise some of the per-chunk stall. Optimised models with shorter GPU stages have more headroom for the asynchronous pipeline to recover.
PyTorch-specific reference implementation. The published sample is PyTorch on top of Diffusers. The primitives generalise to any CUDA host language but the code is not portable without reimplementation.

Source¶

Original: https://aws.amazon.com/blogs/architecture/how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances/
Raw markdown: raw/aws/2026-05-19-how-synthesia-optimizes-generative-ai-video-inference-on-ama-5c943858.md
Reference implementation: aws-samples/sample-asynchronous-video-decoding
Benchmark notebook: 1-benchmark-video-decoding.ipynb
Profile notebook: 2-profile-video-decoding.ipynb
Wan 2.2 model card: Wan-AI/Wan2.2-T2V-A14B-Diffusers
Synthesia Research video models: synthesiaresearch.github.io/express-video
Amazon EC2 G7e: aws.amazon.com/ec2/instance-types/g7e/
PyTorch CUDA Stream docs: torch.cuda.Stream
PyTorch CUDA Event docs: torch.cuda.Event

companies/synthesia — first wiki face: in-house generative-AI video platform; latent-diffusion + VAE-decoder pipeline; G7e customer.
companies/aws — publisher; G7e is one of the GPU-memory- optimised inference instance families.
systems/aws-ec2-g7e — instance family used (96 GB Blackwell VRAM as the GPU-memory-intensive generative-AI workhorse).
systems/nvidia-rtx-pro-6000-blackwell — the GPU under the G7e family.
systems/wan-video-model — Wan 2.2 14B Hugging Face Diffusers benchmarked here.
systems/vae-decoder — the architectural component this post optimises around.
concepts/latent-diffusion-video-generation — algorithmic shape that produces the chunked-decoding bottleneck.
concepts/gpu-kernel-utilization — the metric that goes from 82% → 99.9%.
concepts/device-to-host-transfer — the inter-tier transfer the optimisation overlaps.
concepts/cuda-stream — the primitive used to decouple compute and copy.
concepts/pinned-memory — the primitive that makes D2H copies fully async.
concepts/real-time-factor — the video-decode performance metric (3.21 → 2.95).
patterns/asynchronous-frame-generation-pipeline — the umbrella pattern of this post.
patterns/dual-cuda-stream-compute-and-copy-overlap — the CUDA-stream half of the pattern.
patterns/double-buffer-cuda-events-pipeline-overlap — the buffer + barrier half of the pattern.
concepts/synchronous-vs-asynchronous-readback — the same sync/async axis at compute-pipeline altitude rather than graphics-API altitude.
concepts/async-cpu-gpu-pipelined-scheduling — the same pattern at LLM-batch-serving altitude (post-process N while GPU runs N+1) rather than video-frame altitude.