Skip to content

SYSTEM Cited by 1 source

MediaFM (Netflix)

Definition

MediaFM (Netflix Media Foundational Model) is Netflix's first tri-modal (audio + video + timed-text) foundation model for media understanding — a BERT-style Transformer encoder pre-trained with a Masked Shot Modeling (MSM) self-supervised objective to produce contextual 2304-dimensional shot-level embeddings for Netflix catalog content (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

MediaFM is a single frozen encoder designed to be the shared representation layer for many downstream Netflix services — ads relevancy, clip popularity prediction, clip tagging, recsys cold start for newly-launching titles, optimised promotional assets, and internal content-analysis tools. Netflix's choice to produce embeddings (rather than generative text outputs) is deliberate: "generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning." (footnote 1).

Architectural shape

Title (movie / episode)
  ↓ shot-boundary detection (Souček & Lokoč 2020)
[ shot_1, shot_2, ..., shot_N ]   (N ≤ 512)

For each shot_i:
  frames   → SeqCLIP (Netflix internal)     → v_i  (video embedding)
  audio    → wav2vec2 (Meta FAIR)           → a_i  (audio embedding)
  captions → OpenAI text-embedding-3-large  → t_i  (text embedding; zero if absent)
  fused_i  = unit_norm( concat(v_i, a_i, t_i) )     — 2304 dims

Title metadata (synopses + tags) → OpenAI text-embedding-3-large → g

Sequence input to Transformer (hidden dim h):
  [ linear_proj([CLS]_learnable),
    linear_proj(g),
    linear_proj(fused_1), linear_proj(fused_2), ..., linear_proj(fused_N) ]
  + positional embeddings

→ BERT-style Transformer encoder

→ linear output projection back to 2304 dims (per-shot)

→ per-shot contextual embeddings for downstream tasks

Inputs

  • Unit: a shot, obtained by shot-boundary detection on a movie or episode (collectively called a "title").
  • Three unimodal encoders (all pre-trained; used as feature extractors, not trained in MediaFM):
    • Video: SeqCLIP — Netflix's internal CLIP-style model fine-tuned on video retrieval datasets; operates over frames uniformly sampled from each shot.
    • Audio: Meta FAIR wav2vec2 over the shot's audio samples.
    • Timed text: OpenAI text-embedding-3-large over closed captions / audio descriptions / subtitles for the shot's time range.
  • Concatenation + unit-normalisation yields a single 2304-dim fused embedding per shot.
  • Zero-padding for missing timed text"relatively likely to occur (e.g., in shots without dialogue)" (footnote 2). Audio
  • video are always present.
  • Sequence length cap: up to 512 shots per title per training example.

Sequence construction

  1. Input projection — each fused shot embedding is projected down to the model's hidden dimension via a linear layer.
  2. Special tokens prepended:
    • [CLS] — a learnable embedding at position 0, BERT-style.
    • [GLOBAL] — at position 1, constructed from title-level metadata ("synopses and tags") passed through text-embedding-3-large then projected to hidden dim. Every shot attends to [GLOBAL] → every shot gets title-level context even through a single attention layer.
  3. Positional embeddings added.
  4. Transformer stack — BERT-style encoder; depth + heads not disclosed.
  5. Output projection — final linear layer maps hidden states back up to 2304 dims for the MSM loss target.

Training objective — Masked Shot Modeling (MSM)

  • Masking rate: 20% of shots per sequence replaced with a learnable [MASK] embedding.
  • Target: predict the original fused 2304-dim shot embedding at masked positions.
  • Loss: cosine distance between predicted + ground-truth fused embedding, summed over masked positions.
  • Self-supervised: no labels needed; the supervision comes from the surrounding unmasked shots + the [GLOBAL] title context.

See concepts/masked-shot-modeling for the objective + BERT-MLM-analog framing; patterns/two-stage-pretraining-contrastive-then-masked for the single-stage masked-only contrast with VideoPrism's two-stage pipeline.

Optimisation

  • Muon for hidden-layer parameters.
  • AdamW for the rest.
  • Netflix flags the switch to Muon as delivering "noticeable improvements" — no numerical ablation disclosed.

See concepts/muon-optimizer.

Evaluation — frozen-feature linear probes on five Netflix tasks

MediaFM's encoder is frozen after pre-training; each downstream task trains a task-specific linear layer on top of the contextualised embeddings (patterns/frozen-encoder-linear-probe).

Task Signal Metric Role of MediaFM
Ad Relevancy multi-label classification of clips to ad topics Average Precision retrieval stage — embeddings feed candidate-set identification upstream of the ad serving system
Clip Popularity Ranking predicted CTR rank of clips within a title 10-fold Kendall's τ direct ranker
Clip Tone 100 internal tone categories (creepy / scary / humorous ...) micro Average Precision (tone-averaged) classifier
Clip Genre 11 Netflix core genres (Action / Anime / Comedy / Documentary / Drama / Fantasy / Horror / Kids / Romance / Sci-fi / Thriller) macro Average Precision (genre-averaged) classifier
Clip Retrieval human-annotated "clip-worthy" binary (1:3 pos:neg, 6–10 positives / title) Average Precision ranker for clip selection

MediaFM beats all reported baselines on all five tasks.

"Embedding in context" inference pattern

A critical deployment-time finding: when embedding a short clip (seconds to a minute), run inference on the full episode / movie containing that clip and extract the contextualised vectors for the clip's shot span — do not run inference on just the clip's shots in isolation. The contextualisation baked into each shot's embedding cannot be reproduced by post-hoc aggregation.

See concepts/embedding-in-context.

Ablation — contextualisation > multimodality

Netflix compares MediaFM to a uncontextualised tri-modal baseline (same tri-modal per-shot input; no transformer on top). Per-task findings:

  • Clip tone — multimodality helps "somewhat"; most of the lift comes from contextualisation.
  • Clip popularity ranking — multimodal without context is worse than a single-modality baseline; adding the transformer on top of tri-modal input lifts the model significantly above both. "Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance."
  • Clip retrieval — roughly +15% per improvement — adding modalities then adding context each contribute ~15%.

Motivating observation: "Improvements seem to be larger for tasks that require more detailed narrative understanding." Ad-break placement is the prototypical narrative task.

Production consumers (named at framing level)

  • Ads relevancy — retrieval stage for candidate clip identification.
  • Clip popularity / tone / genre / retrieval — linear probes served as part of Netflix's catalog-tagging + clip-selection pipeline.
  • Recsys cold start for newly-launching titles — MediaFM's content-derived embedding means a brand-new title has a usable representation from the moment audio + video + text are available, with no user-interaction data required (concepts/cold-start).
  • Optimised promotional assets — art + trailers.
  • Internal content-analysis tools.

Netflix flags that MediaFM outputs are "utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion" — a decision-support layer, not an autonomous-action layer.

Forward direction

Netflix explicitly flags pretrained multimodal LLMs (Qwen3-Omni is named) as a potential stronger starting point — a model where the modality fusion has already been learned at massive scale, onto which Netflix can then layer MSM / contextualisation. MediaFM-v2 may drop the tri-modal concat step in favour of a pre- trained omnimodal backbone.

Relationship to adjacent wiki systems

  • CLIP — CLIP is the single-image multimodal ancestor; MediaFM is the long-form multimodal descendant. MediaFM uses a CLIP descendant (SeqCLIP) as a frozen sub-component for the video modality.
  • VideoPrism (Google) — comparable video foundation model from Google. Key architectural differences:
    • VideoPrism: two-stage contrastive-then-masked on raw pixels (patterns/two-stage-pretraining-contrastive-then-masked). MediaFM: masked-only, on top of frozen pre-trained unimodal encoders (SeqCLIP / wav2vec2 / OpenAI text).
    • VideoPrism: spatiotemporal patches as tokens. MediaFM: shots as tokens — much coarser, much longer sequences, designed for title-scale narrative, not second-scale action.
    • VideoPrism: single-modality (video + text only at training time). MediaFM: tri-modal (video + audio + text) at inference time.
  • patterns/multimodal-content-understanding (canonicalised from Dropbox Dash) — Dash operates at scene granularity for enterprise search; MediaFM operates at shot granularity for a single long-form title. Shots are a finer unit within a scene.

Caveats

  • No public numbers on model scale — hidden dim, layers, heads, parameter count, pre-training corpus size (beyond "tens of millions of individual shots").
  • No absolute metrics — only MediaFM-vs-baseline deltas shown as charts; no raw APs / Kendall's τ / numbers.
  • Muon win is asserted not quantified.
  • Title-level metadata source is shallow — only synopses + tags feed [GLOBAL].
  • Zero-padding missing text is a hack — modality-specific masking / gating would be more principled.
  • Not a systems paper — shot-boundary-detection throughput, ingestion pipeline, inference scheduling, embedding store all undescribed.
  • Evaluation is Netflix-internal — no public-benchmark numbers for head-to-head calibration against VideoPrism / VideoMAE / InternVideo.
  • "Various stages of deployment" — not all the reported wins are live in production.

Seen in

Last updated · 319 distilled / 1,201 read