SYSTEM Cited by 1 source
Wan 2.2 (Wan video generation model)¶
Definition¶
Wan (specifically Wan 2.2 14B in the wiki-attested benchmark) is a latent-diffusion video generation model family published by Wan-AI. The wiki sees Wan in the AWS / Synthesia 2026-05-19 post, where the Hugging Face Diffusers Wan 2.2 14B model (Wan-AI/Wan2.2-T2V-A14B-Diffusers) is used as a public-benchmark stand-in for Synthesia's production in-house latent-diffusion video models — chosen for reproducibility because it's an open-source baseline that doesn't require Synthesia's proprietary weights.
Stub page — extend as further sources cite Wan-internal architecture, training data, or model-quality benchmarks. The wiki's interest is currently in Wan as a representative chunked-VAE-decoded latent-diffusion video pipeline rather than in Wan-specific model quality.
Architectural shape¶
- Latent-diffusion video — denoising happens in compressed latent space (via a VAE), not pixel space.
- VAE decoder stage — the final step decodes the latent video back to a human-readable pixel video. This stage is chunked along the temporal dimension: one latent frame yields ~4 time-consecutive pixel frames per chunk.
- Per-chunk D2H — decoded chunks must be transferred to host memory between consecutive kernel launches (otherwise the full decoded video would have to fit in VRAM, breaking the scale-to- long-videos invariant).
- 14B parameters in the variant benchmarked (Wan 2.2 14B T2V-A14B-Diffusers).
See concepts/latent-diffusion-video-generation for the full shape and systems/vae-decoder for the optimised component.
Wiki-attested benchmark posture¶
On a g7e.2xlarge instance (1 × NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM):
- 41-latent-frame test video.
- 1 warmup decode + 10 consecutive full-video decoding cycles.
- Synchronous (sequential) pipeline: mean 21.99 s/video, P99 22.01.
- Asynchronous frame generation pipeline: mean 20.17 s/video, P99 20.20.
- Latency reduction: −8.2%.
- GPU kernel utilisation: 82% → 99.9% (steady state, two consecutive chunks).
- Real Time Factor: 3.21 → 2.95.
The post is careful to note Wan 2.2 14B Diffusers is unoptimised (no model compilation, no fused kernels, no quantisation). On optimised / compiled video models the relative gain from removing GPU stalls would be larger, since the GPU stage takes less wall- clock time and the per-chunk stall fraction grows.
Reference implementation¶
- Public sample: aws-samples/sample-asynchronous-video-decoding applies the patterns/asynchronous-frame-generation-pipeline to the Hugging Face Diffusers Wan 2.2 14B model in PyTorch.
- Sample video used in figures: from the Wan 2.2 repository.
- Wan 2.2 paper: arxiv.org/pdf/2503.20314.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki appearance. Used as the public-benchmark stand-in for the Asynchronous Frame Generation Pipeline on g7e.2xlarge.
Related¶
- systems/stable-diffusion — sibling open-source latent- diffusion family (image, not video; both use VAE-compressed latent space).
- systems/huggingface-inference — the runtime ecosystem (Wan 2.2 14B is published in Diffusers format).
- systems/vae-decoder — the architectural component this benchmark optimises around.
- systems/aws-ec2-g7e — instance family hosting the benchmark.
- systems/nvidia-rtx-pro-6000-blackwell — GPU under the benchmark.
- concepts/latent-diffusion-video-generation — algorithmic shape Wan instantiates.
- patterns/asynchronous-frame-generation-pipeline — pattern applied to Wan in the reference implementation.