Skip to content

CONCEPT Cited by 1 source

Latent-diffusion video generation

Definition

Latent-diffusion video generation is an approach to generative video where the diffusion process — iteratively denoising from Gaussian noise toward a coherent output — runs in a compressed latent space rather than the pixel space directly. A VAE (variational auto-encoder) provides the compression: its encoder maps pixel- space video to a low-dimensional latent representation during training, and its decoder (systems/vae-decoder) maps latents back to pixels at inference time.

The technique extends the same trick that Stable Diffusion applied to image generation — operate in latent space to make compute and memory tractable — to temporally coherent video sequences.

High-level pipeline

text prompt
    ▼ (text encoder)
text embedding
[Diffusion Process — many iterative denoising steps,
                     conditioned on text embedding,
                     in compressed latent space]
final denoised LATENT video
    ▼ (VAE Decoder — chunked along temporal dimension)
final PIXEL video

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances, Figure 1.)

The two cost regimes are very different:

  • Diffusion process — many steps, each step is compute-bound on the GPU. Latents stay in VRAM throughout; no per-step D2H. Bottleneck is FLOPs / attention bandwidth.
  • VAE decoding — one pass, but chunked: the full pixel video doesn't fit in VRAM for arbitrarily-long videos, so decoding emits one chunk of pixel frames at a time and each chunk must D2H + commit to storage before being released. Bottleneck is D2H + host I/O serialisation between chunks, not raw GPU compute.

Why decoding is chunked along the temporal dimension

The pixel-space output is much larger than the latent representation. For arbitrarily-long videos, holding the full decoded pixel video in VRAM is intractable: VRAM scales with the generated video length and quickly exceeds the 96 GB ceiling on RTX PRO 6000 Blackwell / G7e for non-trivial video lengths.

The fix is to decode along the temporal dimension one latent frame at a time. AWS notes a typical chunk shape: one latent frame yields ~4 time-consecutive pixel frames in the Wan 2.2 14B benchmark.

This makes VRAM scale with chunk size, not full-video size — but creates the per-chunk D2H + storage commit serialisation point.

Why the VAE-decoder bottleneck is the binding inference cost

Within latent-diffusion video, the diffusion-process compute dominates wall-clock time, but the VAE decoder is the stage that benefits most from architectural overlap optimisation because:

  • Diffusion steps are GPU-only and don't D2H per-step → no cheap structural fix.
  • VAE decoding has the per-chunk D2H + I/O step that's already on the critical path → can be overlapped with the next chunk's compute → cheap structural fix.

Hence the AWS / Synthesia post focuses on the VAE decoder rather than the diffusion process.

Synthesia's production posture

companies/synthesia uses "in-house developed models...based on various architectures, including latent diffusion video generation models" hosted on EC2 G7e instances for the 96 GB VRAM headroom. Synthesia's specific in-house model architectures and sizes are not disclosed; the AWS post benchmarks on Wan 2.2 14B (open-source Hugging Face Diffusers) for reproducibility and as a public-benchmark stand-in.

Quantitative envelope (wiki-attested benchmark)

On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:

  • Synchronous baseline: mean 21.99 s/video, 82% kernel utilisation, RTF 3.21.
  • Asynchronous Frame Generation Pipeline: mean 20.17 s/video, 99.9% kernel utilisation, RTF 2.95.

(Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.)

Architectural primitives required for efficient inference

For latent-diffusion video to deliver high kernel utilisation on chunked hardware:

  • Memory-efficient chunked VAE decoding — bounded VRAM per chunk.
  • Two CUDA streams — Compute Stream
  • Copy Stream — to overlap VAE compute with D2H.
  • Pinned host buffers — for fully-async D2H DMA.
  • Double-buffering — adjacent chunks operate on distinct memory regions.
  • Worker CPU thread for host-side I/O, freeing the main Python thread to launch CUDA kernels.

See patterns/asynchronous-frame-generation-pipeline for the full assembly.

Generalisation

The chunked-decoding-with-D2H-bottleneck shape generalises to:

  • Streaming generative audio — same shape at audio-frame granularity.
  • Image-batch generation — same shape at batch granularity (decode batch N's pixels while batch N+1 runs).
  • Multimodal video models — adding audio + control signals on top of latent-diffusion video keeps the same pixel-reconstruction-stage shape.

Seen in

Last updated · 542 distilled / 1,571 read