Skip to content

CONCEPT Cited by 1 source

Real Time Factor (RTF)

Definition

Real Time Factor (RTF) is the ratio of processing wall-clock time to output media duration:

RTF = wall_clock_time_to_process / duration_of_output_media

Lower is better. Concrete interpretations:

  • RTF = 1.0 — processing keeps up with playback in real time.
  • RTF < 1.0 — faster than real time (e.g. RTF 0.5 = output generated in half the time it takes to play back).
  • RTF > 1.0 — slower than real time. RTF = 3.21 means a 10-second video takes 32.1 seconds to generate.

RTF is the standard performance metric in speech synthesis and video generation literature — the same units across different model sizes and hardware, and the units are directly meaningful to a product owner ("how long does the user wait per second of generated content?").

Wiki-attested datapoint

Source: sources/2026-05-19-aws-how-synthesia-optimizes-generative-ai-video-inference-on-amazon-ec2-g7e-instances.

On a g7e.2xlarge running the Wan 2.2 14B Hugging Face Diffusers VAE decoder against a 41-latent-frame test video:

Pipeline RTF
Synchronous frame generation 3.21
Asynchronous frame generation 2.95

Both are above 1.0 — the pipeline is slower than real-time. The optimisation reduces the gap by ~8.1%, paid back as latency to the user (faster decode) or as throughput (more videos per GPU per hour) at choice.

RTF vs raw latency

RTF and per-video latency are equivalent metrics for a fixed video duration — they differ only by a constant factor (the video duration itself). Why prefer RTF in published benchmarks:

  • Comparable across video lengths. A 10-second video and a 60-second video have very different latencies but ideally similar RTFs (assuming the pipeline scales linearly).
  • Comparable across hardware. RTF lets you compare a benchmark on g7e.2xlarge with one on a different GPU — the user-visible answer ("how long do I wait per second of generated content?") is the same metric.
  • Connects directly to product economics. RTF × content duration × GPU-hour cost = serving cost per generated piece. AWS's $896/1k-hours theoretical saving is a direct RTF consequence.

Limitations

  • RTF doesn't decompose into per-stage costs. A single RTF number combines diffusion, VAE decoding, D2H, host I/O, and serialisation gaps. Profiling is needed to localise where the RTF is going.
  • RTF is not a tail-latency metric. Reporting only mean RTF hides P99 spikes. AWS's wiki-attested benchmark shows P99 latency tracking the mean closely (P99 22.01 vs mean 21.99 in the synchronous case), so RTF and P99-RTF are similar in this case — but that's not generally true across pipelines.
  • Real Time Factor < 1.0 doesn't imply real-time-streamable output. The pipeline may produce the full video faster than playback time but only deliver it at the end, not as a stream. Streaming-real-time output requires both RTF < 1.0 and bounded per-chunk latency.
  • concepts/gpu-kernel-utilization — orthogonal saturation metric. RTF improvement of 3.21 → 2.95 in the wiki-attested case maps directly to the kernel-utilisation lift from 82% to 99.9% — the GPU now spends ~18% more wall-clock time actually computing.
  • Per-video latency — in this case the equivalent formulation; mean 21.99 s → 20.17 s.
  • Throughput — videos / GPU-hour, the inverse-of-latency framing.
  • Cost per video — RTF × video duration × GPU-hour price.

Generalisation

RTF as a concept generalises to any time-domain generative output:

  • Speech / TTS — wall-clock to generate / duration of generated audio. The historical home of the RTF metric.
  • Streaming audio generation — same.
  • Video generation — wiki-attested.
  • Music generation — same shape.
  • Live agent voice generation — RTF < 1.0 is required for live response; RTF > 1.0 forces a buffered turn-based interaction.

The metric does not generalise meaningfully to non-time-domain outputs (a single image, a snippet of code) where "output duration" has no natural definition.

Seen in

Last updated · 542 distilled / 1,571 read