SYSTEM Cited by 1 source

Wan 2.2 (Wan video generation model)¶

Definition¶

Wan (specifically Wan 2.2 14B in the wiki-attested benchmark) is a latent-diffusion video generation model family published by Wan-AI. The wiki sees Wan in the AWS / Synthesia 2026-05-19 post, where the Hugging Face Diffusers Wan 2.2 14B model (Wan-AI/Wan2.2-T2V-A14B-Diffusers) is used as a public-benchmark stand-in for Synthesia's production in-house latent-diffusion video models — chosen for reproducibility because it's an open-source baseline that doesn't require Synthesia's proprietary weights.

Stub page — extend as further sources cite Wan-internal architecture, training data, or model-quality benchmarks. The wiki's interest is currently in Wan as a representative chunked-VAE-decoded latent-diffusion video pipeline rather than in Wan-specific model quality.

Architectural shape¶

Latent-diffusion video — denoising happens in compressed latent space (via a VAE), not pixel space.
VAE decoder stage — the final step decodes the latent video back to a human-readable pixel video. This stage is chunked along the temporal dimension: one latent frame yields ~4 time-consecutive pixel frames per chunk.
Per-chunk D2H — decoded chunks must be transferred to host memory between consecutive kernel launches (otherwise the full decoded video would have to fit in VRAM, breaking the scale-to- long-videos invariant).
14B parameters in the variant benchmarked (Wan 2.2 14B T2V-A14B-Diffusers).

See concepts/latent-diffusion-video-generation for the full shape and systems/vae-decoder for the optimised component.

Wiki-attested benchmark posture¶

On a g7e.2xlarge instance (1 × NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM):

41-latent-frame test video.
1 warmup decode + 10 consecutive full-video decoding cycles.
Synchronous (sequential) pipeline: mean 21.99 s/video, P99 22.01.
Asynchronous frame generation pipeline: mean 20.17 s/video, P99 20.20.
Latency reduction: −8.2%.
GPU kernel utilisation: 82% → 99.9% (steady state, two consecutive chunks).
Real Time Factor: 3.21 → 2.95.

The post is careful to note Wan 2.2 14B Diffusers is unoptimised (no model compilation, no fused kernels, no quantisation). On optimised / compiled video models the relative gain from removing GPU stalls would be larger, since the GPU stage takes less wall- clock time and the per-chunk stall fraction grows.

Reference implementation¶

Public sample: aws-samples/sample-asynchronous-video-decoding applies the patterns/asynchronous-frame-generation-pipeline to the Hugging Face Diffusers Wan 2.2 14B model in PyTorch.
Sample video used in figures: from the Wan 2.2 repository.
Wan 2.2 paper: arxiv.org/pdf/2503.20314.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki appearance. Used as the public-benchmark stand-in for the Asynchronous Frame Generation Pipeline on g7e.2xlarge.

systems/stable-diffusion — sibling open-source latent- diffusion family (image, not video; both use VAE-compressed latent space).
systems/huggingface-inference — the runtime ecosystem (Wan 2.2 14B is published in Diffusers format).
systems/vae-decoder — the architectural component this benchmark optimises around.
systems/aws-ec2-g7e — instance family hosting the benchmark.
systems/nvidia-rtx-pro-6000-blackwell — GPU under the benchmark.
concepts/latent-diffusion-video-generation — algorithmic shape Wan instantiates.
patterns/asynchronous-frame-generation-pipeline — pattern applied to Wan in the reference implementation.