Synthesia¶
Synthesia (synthesia.io) is an enterprise-focused AI video platform that lets users create video content — including video avatars that synthesise the likeness and voice of real people — without cameras or microphones. The company surfaces in this wiki as the collaborator on AWS's 2026-05-19 post describing the Asynchronous Frame Generation Pipeline optimisation for VAE-decoder video inference, where Synthesia's Research Engineering team contributed both the production problem statement (GPU stalls on D2H during chunked video decoding) and the architectural co-design.
Tier 3 source posture (single-article ingest so far via the AWS co-marketing post; Synthesia does not run a public engineering blog of its own that's in the wiki feed list).
What Synthesia builds¶
- AI video platform, enterprise-targeted, for camera-free video creation.
- Video avatars that synthesise likeness + voice of real people.
- A series of in-house developed video generation models based on various architectures, including latent-diffusion video generation models (Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances; Synthesia Research's published model family is at synthesiaresearch.github.io/express-video).
Architectural posture¶
- Hosts inference on Amazon EC2 — explicitly chooses EC2 over managed inference services for the "flexibility and control over the underlying hardware that the service provides".
- Runs on the G7e instance family as a cost-efficient option for GPU-memory-intensive generative AI video models — the 96 GB VRAM on the NVIDIA RTX PRO 6000 Blackwell is the load-bearing reason. Latent-diffusion video generation has a large activation + KV footprint per generated frame; VRAM ceilings constrain what models can run end-to-end on one GPU.
- Uses VAE decoders in production model architectures — the component the AWS post optimises around. VAE-decoder stages had become a measurable bottleneck on Synthesia's serving GPUs because the rate of saving decoded video frames to storage was lower than the GPU's compute rate, causing per-chunk GPU stalls.
- Co-designed the Asynchronous Frame Generation Pipeline with AWS to remove that bottleneck. Synthesia Research Engineering is named as the collaborating team on the technique that ships in AWS's 2026-05-19 post and reference implementation.
Why the sysdesign-wiki cares¶
Synthesia is a customer-as-co-author disclosure on the AWS blog: the post is rare among AWS Architecture Blog content in that it discloses a specific production-bottleneck shape (VAE-decoder D2H stall on a chunked latent-diffusion video pipeline) that originated on a customer's serving fleet. The shape generalises beyond Synthesia — any chunked video generation pipeline that transfers frames to host has the same potential bottleneck — but Synthesia's production load is what made the gap visible enough to instrument, profile, and fix.
The disclosure is the first wiki canonicalisation of the VAE-decoder-bottleneck-on-D2H failure mode at production altitude, distinct from related patterns at LLM-batch-serving altitude (post-process batch N while GPU runs N+1, see concepts/async-cpu-gpu-pipelined-scheduling) and at graphics-API altitude (sync vs async readback, see concepts/synchronous-vs-asynchronous-readback).
Recent articles¶
- 2026-05-19 — How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances (AWS Architecture Blog, co-authored with Synthesia Research Engineering). sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances. First wiki disclosure: VAE-decoder D2H stall as the binding bottleneck for chunked latent-diffusion video inference; the Asynchronous Frame Generation Pipeline (dual CUDA streams + double pinned host buffers + dedicated worker thread + CUDA events) lifts kernel utilisation from 82% → 99.9% and cuts decode latency by 8.2%. Reference implementation: aws-samples/sample-asynchronous-video-decoding.
Key systems (as surfaced in ingested sources)¶
- systems/aws-ec2-g7e — production GPU substrate for Synthesia's video inference (NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM). Chosen for VRAM headroom on GPU-memory-intensive latent-diffusion video models.
- systems/nvidia-rtx-pro-6000-blackwell — the GPU under the G7e family.
- systems/vae-decoder — the production architectural component whose D2H bottleneck triggered the optimisation work.
- systems/wan-video-model — public-benchmark stand-in (Hugging Face Diffusers Wan 2.2 14B) used in the reference benchmark for reproducibility; Synthesia's own in-house production models are not disclosed but use related latent-diffusion architectures.
Key concepts and patterns surfaced¶
- concepts/latent-diffusion-video-generation — the algorithmic shape of Synthesia's video model family (denoise in compressed latent space, decode back to pixel space via VAE).
- concepts/gpu-kernel-utilization — the metric that surfaced the bottleneck (82% baseline).
- concepts/device-to-host-transfer — the inter-tier transfer step that was serialising the pipeline.
- concepts/cuda-stream — the CUDA primitive used to decouple compute from copy.
- concepts/pinned-memory — page-locked host RAM as the prerequisite for fully-async D2H.
- concepts/real-time-factor — the video-decode performance metric Synthesia tracks (RTF 3.21 → 2.95).
- patterns/asynchronous-frame-generation-pipeline — the umbrella pattern Synthesia + AWS co-designed.
- patterns/dual-cuda-stream-compute-and-copy-overlap — half of the pattern: CUDA-stream-side overlap.
- patterns/double-buffer-cuda-events-pipeline-overlap — half of the pattern: buffer + barrier-side overlap.
Open questions / not disclosed¶
- Synthesia's in-house production model names + sizes are not in the post (the published Wan 2.2 14B Diffusers benchmark is a reproducibility stand-in).
- Production-fleet realised gain — the 8.2% / RTF 3.21→2.95 numbers are on the unoptimised public benchmark; Synthesia's own fleet-wide gains aren't quantified.
- Multi-instance / multi-tenant serving posture — the post is per-decoder (one g7e.2xlarge); Synthesia's broader inference serving topology (autoscaling, queueing, request batching, cold-start) is not disclosed.
- Other model stages — the post addresses the VAE-decoder stage only; the upstream diffusion-process stage's kernel utilisation is not covered.
Related¶
- companies/aws — co-author of the 2026-05-19 post.
- systems/aws-ec2-g7e — production GPU instance family.
- systems/vae-decoder — the architectural component being optimised.
- concepts/latent-diffusion-video-generation — Synthesia's algorithmic backbone.
- patterns/asynchronous-frame-generation-pipeline — collaboration output.