Skip to content

SYSTEM Cited by 1 source

Wan 2.2 (Wan video generation model)

Definition

Wan (specifically Wan 2.2 14B in the wiki-attested benchmark) is a latent-diffusion video generation model family published by Wan-AI. The wiki sees Wan in the AWS / Synthesia 2026-05-19 post, where the Hugging Face Diffusers Wan 2.2 14B model (Wan-AI/Wan2.2-T2V-A14B-Diffusers) is used as a public-benchmark stand-in for Synthesia's production in-house latent-diffusion video models — chosen for reproducibility because it's an open-source baseline that doesn't require Synthesia's proprietary weights.

Stub page — extend as further sources cite Wan-internal architecture, training data, or model-quality benchmarks. The wiki's interest is currently in Wan as a representative chunked-VAE-decoded latent-diffusion video pipeline rather than in Wan-specific model quality.

Architectural shape

  • Latent-diffusion video — denoising happens in compressed latent space (via a VAE), not pixel space.
  • VAE decoder stage — the final step decodes the latent video back to a human-readable pixel video. This stage is chunked along the temporal dimension: one latent frame yields ~4 time-consecutive pixel frames per chunk.
  • Per-chunk D2H — decoded chunks must be transferred to host memory between consecutive kernel launches (otherwise the full decoded video would have to fit in VRAM, breaking the scale-to- long-videos invariant).
  • 14B parameters in the variant benchmarked (Wan 2.2 14B T2V-A14B-Diffusers).

See concepts/latent-diffusion-video-generation for the full shape and systems/vae-decoder for the optimised component.

Wiki-attested benchmark posture

On a g7e.2xlarge instance (1 × NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM):

  • 41-latent-frame test video.
  • 1 warmup decode + 10 consecutive full-video decoding cycles.
  • Synchronous (sequential) pipeline: mean 21.99 s/video, P99 22.01.
  • Asynchronous frame generation pipeline: mean 20.17 s/video, P99 20.20.
  • Latency reduction: −8.2%.
  • GPU kernel utilisation: 82% → 99.9% (steady state, two consecutive chunks).
  • Real Time Factor: 3.21 → 2.95.

The post is careful to note Wan 2.2 14B Diffusers is unoptimised (no model compilation, no fused kernels, no quantisation). On optimised / compiled video models the relative gain from removing GPU stalls would be larger, since the GPU stage takes less wall- clock time and the per-chunk stall fraction grows.

Reference implementation

Seen in

Last updated · 542 distilled / 1,571 read