Skip to content

Synthesia

Synthesia (synthesia.io) is an enterprise-focused AI video platform that lets users create video content — including video avatars that synthesise the likeness and voice of real people — without cameras or microphones. The company surfaces in this wiki as the collaborator on AWS's 2026-05-19 post describing the Asynchronous Frame Generation Pipeline optimisation for VAE-decoder video inference, where Synthesia's Research Engineering team contributed both the production problem statement (GPU stalls on D2H during chunked video decoding) and the architectural co-design.

Tier 3 source posture (single-article ingest so far via the AWS co-marketing post; Synthesia does not run a public engineering blog of its own that's in the wiki feed list).

What Synthesia builds

Architectural posture

  • Hosts inference on Amazon EC2 — explicitly chooses EC2 over managed inference services for the "flexibility and control over the underlying hardware that the service provides".
  • Runs on the G7e instance family as a cost-efficient option for GPU-memory-intensive generative AI video models — the 96 GB VRAM on the NVIDIA RTX PRO 6000 Blackwell is the load-bearing reason. Latent-diffusion video generation has a large activation + KV footprint per generated frame; VRAM ceilings constrain what models can run end-to-end on one GPU.
  • Uses VAE decoders in production model architectures — the component the AWS post optimises around. VAE-decoder stages had become a measurable bottleneck on Synthesia's serving GPUs because the rate of saving decoded video frames to storage was lower than the GPU's compute rate, causing per-chunk GPU stalls.
  • Co-designed the Asynchronous Frame Generation Pipeline with AWS to remove that bottleneck. Synthesia Research Engineering is named as the collaborating team on the technique that ships in AWS's 2026-05-19 post and reference implementation.

Why the sysdesign-wiki cares

Synthesia is a customer-as-co-author disclosure on the AWS blog: the post is rare among AWS Architecture Blog content in that it discloses a specific production-bottleneck shape (VAE-decoder D2H stall on a chunked latent-diffusion video pipeline) that originated on a customer's serving fleet. The shape generalises beyond Synthesia — any chunked video generation pipeline that transfers frames to host has the same potential bottleneck — but Synthesia's production load is what made the gap visible enough to instrument, profile, and fix.

The disclosure is the first wiki canonicalisation of the VAE-decoder-bottleneck-on-D2H failure mode at production altitude, distinct from related patterns at LLM-batch-serving altitude (post-process batch N while GPU runs N+1, see concepts/async-cpu-gpu-pipelined-scheduling) and at graphics-API altitude (sync vs async readback, see concepts/synchronous-vs-asynchronous-readback).

Recent articles

Key systems (as surfaced in ingested sources)

  • systems/aws-ec2-g7e — production GPU substrate for Synthesia's video inference (NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM). Chosen for VRAM headroom on GPU-memory-intensive latent-diffusion video models.
  • systems/nvidia-rtx-pro-6000-blackwell — the GPU under the G7e family.
  • systems/vae-decoder — the production architectural component whose D2H bottleneck triggered the optimisation work.
  • systems/wan-video-model — public-benchmark stand-in (Hugging Face Diffusers Wan 2.2 14B) used in the reference benchmark for reproducibility; Synthesia's own in-house production models are not disclosed but use related latent-diffusion architectures.

Key concepts and patterns surfaced

Open questions / not disclosed

  • Synthesia's in-house production model names + sizes are not in the post (the published Wan 2.2 14B Diffusers benchmark is a reproducibility stand-in).
  • Production-fleet realised gain — the 8.2% / RTF 3.21→2.95 numbers are on the unoptimised public benchmark; Synthesia's own fleet-wide gains aren't quantified.
  • Multi-instance / multi-tenant serving posture — the post is per-decoder (one g7e.2xlarge); Synthesia's broader inference serving topology (autoscaling, queueing, request batching, cold-start) is not disclosed.
  • Other model stages — the post addresses the VAE-decoder stage only; the upstream diffusion-process stage's kernel utilisation is not covered.
Last updated · 542 distilled / 1,571 read