Skip to content

PATTERN Cited by 1 source

Deduplicate decode across encoder lanes

Intent

When a multi-lane video transcoding pipeline must produce many encoded outputs from a single source (typically a DASH ladder), run one decoder and many parallel encoders inside the same process, rather than spawning a separate process per lane.

The pattern eliminates duplicate decoding, duplicate process startup, and per-lane cold-start overhead — while enabling an additional level of parallelism across encoders.

Motivation

Consider N DASH lanes at different resolutions/codecs:

  • N processes serial. Correct but wall-clock N× slower.
  • N processes parallel. Wall-clock OK but decodes the source N times and pays N × process-startup cost.
  • 1 process, N outputs. Decodes once; pays one process-startup cost; all N encoders share the decoded frame stream.

At Meta's scale — > 1 billion video uploads per day, each requiring multiple FFmpeg executions — the decode duplication alone costs enough fleet CPU to make the pattern non-optional.

"To work around this, multiple outputs could be generated within a single FFmpeg command line, decoding the frames of a video once and sending them to each output's encoder instance. This eliminates a lot of overhead by deduplicating the video decoding and process startup time overhead incurred by each command line." (Source: sources/2026-03-09-meta-ffmpeg-at-meta-media-processing-at-scale)

The additional win: per-frame encoder parallelism

Having all N encoders in the same process also unlocks per-frame encoder parallelism: for each incoming decoded frame, fan the frame out to all N encoders in parallel, rather than letting them run serially per-frame.

FFmpeg historically ran encoders serially per frame even when intra-encoder threading was active. Meta's internal FFmpeg fork added per-frame parallelism across encoders as its headline optimisation, and it stayed Meta-private for years.

Upstream trajectory

FFmpeg 6.0 began the refactoring to support this shape in upstream; FFmpeg 8.0 finished it:

"Thanks to contributions from FFmpeg developers, including those at FFlabs and VideoLAN, more efficient threading was implemented starting with FFmpeg 6.0, with the finishing touches landing in 8.0. This was directly influenced by the design of our internal fork and was one of the main features we had relied on it to provide. This development led to the most complex refactoring of FFmpeg in decades and has enabled more efficient encodings for all FFmpeg users."

This is a high-impact instance of patterns/upstream-the-fix: a load-bearing internal optimisation was re-landed upstream over multiple releases so Meta could deprecate its fork and every other FFmpeg user gets the same efficiency win.

Consequences

  • + Source is decoded once per upload regardless of lane count.
  • + Process-startup cost pays once per upload.
  • + Enables per-frame encoder parallelism (after the 6.0-8.0 upstream work).
  • + Enables in-loop decoding for per-lane live quality metrics — another shape only viable in a single-process pipeline.
  • Single process → single blast radius; a crash or hang kills every lane. Error isolation at the outer orchestration layer is the mitigation.
  • Harder to mix-and-match transcoding across heterogeneous hardware (some lanes on one accelerator, some on another) while staying in a single process.

When to use

  • Any multi-output video transcoding pipeline, from hyperscale (Meta) down to smaller platforms that produce a DASH/HLS ladder.
  • Especially when source decode is a non-trivial CPU cost — which is almost always true.

When not to use

  • Single-output transcodes.
  • Cases where lane isolation matters more than efficiency (e.g. different SLAs per lane in the same multi-output command, and one lane's failure shouldn't take down another lane).

Seen in

Last updated · 319 distilled / 1,201 read