CONCEPT Cited by 1 source

Latent-diffusion video generation¶

Definition¶

Latent-diffusion video generation is an approach to generative video where the diffusion process — iteratively denoising from Gaussian noise toward a coherent output — runs in a compressed latent space rather than the pixel space directly. A VAE (variational auto-encoder) provides the compression: its encoder maps pixel- space video to a low-dimensional latent representation during training, and its decoder (systems/vae-decoder) maps latents back to pixels at inference time.

The technique extends the same trick that Stable Diffusion applied to image generation — operate in latent space to make compute and memory tractable — to temporally coherent video sequences.

High-level pipeline¶

text prompt
    │
    ▼ (text encoder)
text embedding
    │
    ▼
[Diffusion Process — many iterative denoising steps,
                     conditioned on text embedding,
                     in compressed latent space]
    │
    ▼
final denoised LATENT video
    │
    ▼ (VAE Decoder — chunked along temporal dimension)
final PIXEL video

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Figure 1.)

The two cost regimes are very different:

Diffusion process — many steps, each step is compute-bound on the GPU. Latents stay in VRAM throughout; no per-step D2H. Bottleneck is FLOPs / attention bandwidth.
VAE decoding — one pass, but chunked: the full pixel video doesn't fit in VRAM for arbitrarily-long videos, so decoding emits one chunk of pixel frames at a time and each chunk must D2H + commit to storage before being released. Bottleneck is D2H + host I/O serialisation between chunks, not raw GPU compute.

Why decoding is chunked along the temporal dimension¶

The pixel-space output is much larger than the latent representation. For arbitrarily-long videos, holding the full decoded pixel video in VRAM is intractable: VRAM scales with the generated video length and quickly exceeds the 96 GB ceiling on RTX PRO 6000 Blackwell / G7e for non-trivial video lengths.

The fix is to decode along the temporal dimension one latent frame at a time. AWS notes a typical chunk shape: one latent frame yields ~4 time-consecutive pixel frames in the Wan 2.2 14B benchmark.

This makes VRAM scale with chunk size, not full-video size — but creates the per-chunk D2H + storage commit serialisation point.

Why the VAE-decoder bottleneck is the binding inference cost¶

Within latent-diffusion video, the diffusion-process compute dominates wall-clock time, but the VAE decoder is the stage that benefits most from architectural overlap optimisation because:

Diffusion steps are GPU-only and don't D2H per-step → no cheap structural fix.
VAE decoding has the per-chunk D2H + I/O step that's already on the critical path → can be overlapped with the next chunk's compute → cheap structural fix.

Hence the AWS / Synthesia post focuses on the VAE decoder rather than the diffusion process.

Synthesia's production posture¶

companies/synthesia uses "in-house developed models...based on various architectures, including latent diffusion video generation models" hosted on EC2 G7e instances for the 96 GB VRAM headroom. Synthesia's specific in-house model architectures and sizes are not disclosed; the AWS post benchmarks on Wan 2.2 14B (open-source Hugging Face Diffusers) for reproducibility and as a public-benchmark stand-in.

Quantitative envelope (wiki-attested benchmark)¶

On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:

Synchronous baseline: mean 21.99 s/video, 82% kernel utilisation, RTF 3.21.
Asynchronous Frame Generation Pipeline: mean 20.17 s/video, 99.9% kernel utilisation, RTF 2.95.

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

Architectural primitives required for efficient inference¶

For latent-diffusion video to deliver high kernel utilisation on chunked hardware:

Memory-efficient chunked VAE decoding — bounded VRAM per chunk.
Two CUDA streams — Compute Stream
Copy Stream — to overlap VAE compute with D2H.
Pinned host buffers — for fully-async D2H DMA.
Double-buffering — adjacent chunks operate on distinct memory regions.
Worker CPU thread for host-side I/O, freeing the main Python thread to launch CUDA kernels.

See patterns/asynchronous-frame-generation-pipeline for the full assembly.

Generalisation¶

The chunked-decoding-with-D2H-bottleneck shape generalises to:

Streaming generative audio — same shape at audio-frame granularity.
Image-batch generation — same shape at batch granularity (decode batch N's pixels while batch N+1 runs).
Multimodal video models — adding audio + control signals on top of latent-diffusion video keeps the same pixel-reconstruction-stage shape.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation of the algorithmic shape. Synthesia hosts in-house latent-diffusion video models on EC2 G7e; AWS benchmarks on Wan 2.2 14B Hugging Face Diffusers as a public stand-in. The VAE decoder is the inference stage where the D2H bottleneck lives, and the Asynchronous Frame Generation Pipeline is the architectural fix.

systems/stable-diffusion — sibling latent-diffusion family (image, not video).
systems/wan-video-model — wiki-attested instance.
systems/vae-decoder — the bottleneck stage.
concepts/device-to-host-transfer — the inter-tier transfer that serialises chunks.
concepts/gpu-kernel-utilization — saturation metric.
concepts/real-time-factor — performance metric (3.21 → 2.95 in the wiki-attested case).
patterns/asynchronous-frame-generation-pipeline — fix pattern.
companies/synthesia — wiki-attested production user.