CONCEPT Cited by 1 source
Real Time Factor (RTF)¶
Definition¶
Real Time Factor (RTF) is the ratio of processing wall-clock time to output media duration:
Lower is better. Concrete interpretations:
- RTF = 1.0 — processing keeps up with playback in real time.
- RTF < 1.0 — faster than real time (e.g. RTF 0.5 = output generated in half the time it takes to play back).
- RTF > 1.0 — slower than real time. RTF = 3.21 means a 10-second video takes 32.1 seconds to generate.
RTF is the standard performance metric in speech synthesis and video generation literature — the same units across different model sizes and hardware, and the units are directly meaningful to a product owner ("how long does the user wait per second of generated content?").
Wiki-attested datapoint¶
On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:
| Pipeline | RTF |
|---|---|
| Synchronous frame generation | 3.21 |
| Asynchronous frame generation | 2.95 |
Both are above 1.0 — the pipeline is slower than real-time. The optimisation reduces the gap by ~8.1%, paid back as latency to the user (faster decode) or as throughput (more videos per GPU per hour) at choice.
RTF vs raw latency¶
RTF and per-video latency are equivalent metrics for a fixed video duration — they differ only by a constant factor (the video duration itself). Why prefer RTF in published benchmarks:
- Comparable across video lengths. A 10-second video and a 60-second video have very different latencies but ideally similar RTFs (assuming the pipeline scales linearly).
- Comparable across hardware. RTF lets you compare a benchmark on g7e.2xlarge with one on a different GPU — the user-visible answer ("how long do I wait per second of generated content?") is the same metric.
- Connects directly to product economics. RTF × content duration × GPU-hour cost = serving cost per generated piece. AWS's $896/1k-hours theoretical saving is a direct RTF consequence.
Limitations¶
- RTF doesn't decompose into per-stage costs. A single RTF number combines diffusion, VAE decoding, D2H, host I/O, and serialisation gaps. Profiling is needed to localise where the RTF is going.
- RTF is not a tail-latency metric. Reporting only mean RTF hides P99 spikes. AWS's wiki-attested benchmark shows P99 latency tracking the mean closely (P99 22.01 vs mean 21.99 in the synchronous case), so RTF and P99-RTF are similar in this case — but that's not generally true across pipelines.
- Real Time Factor < 1.0 doesn't imply real-time-streamable output. The pipeline may produce the full video faster than playback time but only deliver it at the end, not as a stream. Streaming-real-time output requires both RTF < 1.0 and bounded per-chunk latency.
Related metrics in the wiki¶
- concepts/gpu-kernel-utilization — orthogonal saturation metric. RTF improvement of 3.21 → 2.95 in the wiki-attested case maps directly to the kernel-utilisation lift from 82% to 99.9% — the GPU now spends ~18% more wall-clock time actually computing.
- Per-video latency — in this case the equivalent formulation; mean 21.99 s → 20.17 s.
- Throughput — videos / GPU-hour, the inverse-of-latency framing.
- Cost per video — RTF × video duration × GPU-hour price.
Generalisation¶
RTF as a concept generalises to any time-domain generative output:
- Speech / TTS — wall-clock to generate / duration of generated audio. The historical home of the RTF metric.
- Streaming audio generation — same.
- Video generation — wiki-attested.
- Music generation — same shape.
- Live agent voice generation — RTF < 1.0 is required for live response; RTF > 1.0 forces a buffered turn-based interaction.
The metric does not generalise meaningfully to non-time-domain outputs (a single image, a snippet of code) where "output duration" has no natural definition.
Seen in¶
- sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. RTF = 3.21 (synchronous) → 2.95 (asynchronous frame generation pipeline) on the unoptimised Wan 2.2 14B Hugging Face Diffusers VAE decoder running on g7e.2xlarge.
Related¶
- concepts/gpu-kernel-utilization — orthogonal saturation metric whose improvement underlies the RTF improvement.
- concepts/latent-diffusion-video-generation — workload shape RTF is reported on.
- patterns/asynchronous-frame-generation-pipeline — pattern whose effect is reported as RTF improvement.