CONCEPT Cited by 1 source
Latent-diffusion video generation¶
Definition¶
Latent-diffusion video generation is an approach to generative video where the diffusion process — iteratively denoising from Gaussian noise toward a coherent output — runs in a compressed latent space rather than the pixel space directly. A VAE (variational auto-encoder) provides the compression: its encoder maps pixel- space video to a low-dimensional latent representation during training, and its decoder (systems/vae-decoder) maps latents back to pixels at inference time.
The technique extends the same trick that Stable Diffusion applied to image generation — operate in latent space to make compute and memory tractable — to temporally coherent video sequences.
High-level pipeline¶
text prompt
│
▼ (text encoder)
text embedding
│
▼
[Diffusion Process — many iterative denoising steps,
conditioned on text embedding,
in compressed latent space]
│
▼
final denoised LATENT video
│
▼ (VAE Decoder — chunked along temporal dimension)
final PIXEL video
(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Figure 1.)
The two cost regimes are very different:
- Diffusion process — many steps, each step is compute-bound on the GPU. Latents stay in VRAM throughout; no per-step D2H. Bottleneck is FLOPs / attention bandwidth.
- VAE decoding — one pass, but chunked: the full pixel video doesn't fit in VRAM for arbitrarily-long videos, so decoding emits one chunk of pixel frames at a time and each chunk must D2H + commit to storage before being released. Bottleneck is D2H + host I/O serialisation between chunks, not raw GPU compute.
Why decoding is chunked along the temporal dimension¶
The pixel-space output is much larger than the latent representation. For arbitrarily-long videos, holding the full decoded pixel video in VRAM is intractable: VRAM scales with the generated video length and quickly exceeds the 96 GB ceiling on RTX PRO 6000 Blackwell / G7e for non-trivial video lengths.
The fix is to decode along the temporal dimension one latent frame at a time. AWS notes a typical chunk shape: one latent frame yields ~4 time-consecutive pixel frames in the Wan 2.2 14B benchmark.
This makes VRAM scale with chunk size, not full-video size — but creates the per-chunk D2H + storage commit serialisation point.
Why the VAE-decoder bottleneck is the binding inference cost¶
Within latent-diffusion video, the diffusion-process compute dominates wall-clock time, but the VAE decoder is the stage that benefits most from architectural overlap optimisation because:
- Diffusion steps are GPU-only and don't D2H per-step → no cheap structural fix.
- VAE decoding has the per-chunk D2H + I/O step that's already on the critical path → can be overlapped with the next chunk's compute → cheap structural fix.
Hence the AWS / Synthesia post focuses on the VAE decoder rather than the diffusion process.
Synthesia's production posture¶
companies/synthesia uses "in-house developed models...based on various architectures, including latent diffusion video generation models" hosted on EC2 G7e instances for the 96 GB VRAM headroom. Synthesia's specific in-house model architectures and sizes are not disclosed; the AWS post benchmarks on Wan 2.2 14B (open-source Hugging Face Diffusers) for reproducibility and as a public-benchmark stand-in.
Quantitative envelope (wiki-attested benchmark)¶
On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:
- Synchronous baseline: mean 21.99 s/video, 82% kernel utilisation, RTF 3.21.
- Asynchronous Frame Generation Pipeline: mean 20.17 s/video, 99.9% kernel utilisation, RTF 2.95.
Architectural primitives required for efficient inference¶
For latent-diffusion video to deliver high kernel utilisation on chunked hardware:
- Memory-efficient chunked VAE decoding — bounded VRAM per chunk.
- Two CUDA streams — Compute Stream
- Copy Stream — to overlap VAE compute with D2H.
- Pinned host buffers — for fully-async D2H DMA.
- Double-buffering — adjacent chunks operate on distinct memory regions.
- Worker CPU thread for host-side I/O, freeing the main Python thread to launch CUDA kernels.
See patterns/asynchronous-frame-generation-pipeline for the full assembly.
Generalisation¶
The chunked-decoding-with-D2H-bottleneck shape generalises to:
- Streaming generative audio — same shape at audio-frame granularity.
- Image-batch generation — same shape at batch granularity (decode batch N's pixels while batch N+1 runs).
- Multimodal video models — adding audio + control signals on top of latent-diffusion video keeps the same pixel-reconstruction-stage shape.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation of the algorithmic shape. Synthesia hosts in-house latent-diffusion video models on EC2 G7e; AWS benchmarks on Wan 2.2 14B Hugging Face Diffusers as a public stand-in. The VAE decoder is the inference stage where the D2H bottleneck lives, and the Asynchronous Frame Generation Pipeline is the architectural fix.
Related¶
- systems/stable-diffusion — sibling latent-diffusion family (image, not video).
- systems/wan-video-model — wiki-attested instance.
- systems/vae-decoder — the bottleneck stage.
- concepts/device-to-host-transfer — the inter-tier transfer that serialises chunks.
- concepts/gpu-kernel-utilization — saturation metric.
- concepts/real-time-factor — performance metric (3.21 → 2.95 in the wiki-attested case).
- patterns/asynchronous-frame-generation-pipeline — fix pattern.
- companies/synthesia — wiki-attested production user.