CONCEPT Cited by 1 source

Real Time Factor (RTF)¶

Definition¶

Real Time Factor (RTF) is the ratio of processing wall-clock time to output media duration:

RTF = wall_clock_time_to_process / duration_of_output_media

Lower is better. Concrete interpretations:

RTF = 1.0 — processing keeps up with playback in real time.
RTF < 1.0 — faster than real time (e.g. RTF 0.5 = output generated in half the time it takes to play back).
RTF > 1.0 — slower than real time. RTF = 3.21 means a 10-second video takes 32.1 seconds to generate.

RTF is the standard performance metric in speech synthesis and video generation literature — the same units across different model sizes and hardware, and the units are directly meaningful to a product owner ("how long does the user wait per second of generated content?").

Wiki-attested datapoint¶

Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.

On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:

Pipeline	RTF
Synchronous frame generation	3.21
Asynchronous frame generation	2.95

Both are above 1.0 — the pipeline is slower than real-time. The optimisation reduces the gap by ~8.1%, paid back as latency to the user (faster decode) or as throughput (more videos per GPU per hour) at choice.

RTF vs raw latency¶

RTF and per-video latency are equivalent metrics for a fixed video duration — they differ only by a constant factor (the video duration itself). Why prefer RTF in published benchmarks:

Comparable across video lengths. A 10-second video and a 60-second video have very different latencies but ideally similar RTFs (assuming the pipeline scales linearly).
Comparable across hardware. RTF lets you compare a benchmark on g7e.2xlarge with one on a different GPU — the user-visible answer ("how long do I wait per second of generated content?") is the same metric.
Connects directly to product economics. RTF × content duration × GPU-hour cost = serving cost per generated piece. AWS's $896/1k-hours theoretical saving is a direct RTF consequence.

Limitations¶

RTF doesn't decompose into per-stage costs. A single RTF number combines diffusion, VAE decoding, D2H, host I/O, and serialisation gaps. Profiling is needed to localise where the RTF is going.
RTF is not a tail-latency metric. Reporting only mean RTF hides P99 spikes. AWS's wiki-attested benchmark shows P99 latency tracking the mean closely (P99 22.01 vs mean 21.99 in the synchronous case), so RTF and P99-RTF are similar in this case — but that's not generally true across pipelines.
Real Time Factor < 1.0 doesn't imply real-time-streamable output. The pipeline may produce the full video faster than playback time but only deliver it at the end, not as a stream. Streaming-real-time output requires both RTF < 1.0 and bounded per-chunk latency.

concepts/gpu-kernel-utilization — orthogonal saturation metric. RTF improvement of 3.21 → 2.95 in the wiki-attested case maps directly to the kernel-utilisation lift from 82% to 99.9% — the GPU now spends ~18% more wall-clock time actually computing.
Per-video latency — in this case the equivalent formulation; mean 21.99 s → 20.17 s.
Throughput — videos / GPU-hour, the inverse-of-latency framing.
Cost per video — RTF × video duration × GPU-hour price.

Generalisation¶

RTF as a concept generalises to any time-domain generative output:

Speech / TTS — wall-clock to generate / duration of generated audio. The historical home of the RTF metric.
Streaming audio generation — same.
Video generation — wiki-attested.
Music generation — same shape.
Live agent voice generation — RTF < 1.0 is required for live response; RTF > 1.0 forces a buffered turn-based interaction.

The metric does not generalise meaningfully to non-time-domain outputs (a single image, a snippet of code) where "output duration" has no natural definition.

Seen in¶

sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances — first wiki canonicalisation. RTF = 3.21 (synchronous) → 2.95 (asynchronous frame generation pipeline) on the unoptimised Wan 2.2 14B Hugging Face Diffusers VAE decoder running on g7e.2xlarge.

concepts/gpu-kernel-utilization — orthogonal saturation metric whose improvement underlies the RTF improvement.
concepts/latent-diffusion-video-generation — workload shape RTF is reported on.
patterns/asynchronous-frame-generation-pipeline — pattern whose effect is reported as RTF improvement.