SYSTEM Cited by 1 source

MediaFM (Netflix)¶

Definition¶

MediaFM (Netflix Media Foundational Model) is Netflix's first tri-modal (audio + video + timed-text) foundation model for media understanding — a BERT-style Transformer encoder pre-trained with a Masked Shot Modeling (MSM) self-supervised objective to produce contextual 2304-dimensional shot-level embeddings for Netflix catalog content (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

MediaFM is a single frozen encoder designed to be the shared representation layer for many downstream Netflix services — ads relevancy, clip popularity prediction, clip tagging, recsys cold start for newly-launching titles, optimised promotional assets, and internal content-analysis tools. Netflix's choice to produce embeddings (rather than generative text outputs) is deliberate: "generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning." (footnote 1).

Architectural shape¶

Title (movie / episode)
  ↓ shot-boundary detection (Souček & Lokoč 2020)
[ shot_1, shot_2, ..., shot_N ]   (N ≤ 512)

For each shot_i:
  frames   → SeqCLIP (Netflix internal)     → v_i  (video embedding)
  audio    → wav2vec2 (Meta FAIR)           → a_i  (audio embedding)
  captions → OpenAI text-embedding-3-large  → t_i  (text embedding; zero if absent)
  fused_i  = unit_norm( concat(v_i, a_i, t_i) )     — 2304 dims

Title metadata (synopses + tags) → OpenAI text-embedding-3-large → g

Sequence input to Transformer (hidden dim h):
  [ linear_proj([CLS]_learnable),
    linear_proj(g),
    linear_proj(fused_1), linear_proj(fused_2), ..., linear_proj(fused_N) ]
  + positional embeddings

→ BERT-style Transformer encoder

→ linear output projection back to 2304 dims (per-shot)

→ per-shot contextual embeddings for downstream tasks

Inputs¶

Unit: a shot, obtained by shot-boundary detection on a movie or episode (collectively called a "title").
Three unimodal encoders (all pre-trained; used as feature extractors, not trained in MediaFM):
- Video: SeqCLIP — Netflix's internal CLIP-style model fine-tuned on video retrieval datasets; operates over frames uniformly sampled from each shot.
- Audio: Meta FAIR wav2vec2 over the shot's audio samples.
- Timed text: OpenAI text-embedding-3-large over closed captions / audio descriptions / subtitles for the shot's time range.
Concatenation + unit-normalisation yields a single 2304-dim fused embedding per shot.
Zero-padding for missing timed text — "relatively likely to occur (e.g., in shots without dialogue)" (footnote 2). Audio
video are always present.
Sequence length cap: up to 512 shots per title per training example.

Sequence construction¶

Input projection — each fused shot embedding is projected down to the model's hidden dimension via a linear layer.
Special tokens prepended:
- [CLS] — a learnable embedding at position 0, BERT-style.
- [GLOBAL] — at position 1, constructed from title-level metadata ("synopses and tags") passed through text-embedding-3-large then projected to hidden dim. Every shot attends to [GLOBAL] → every shot gets title-level context even through a single attention layer.
Positional embeddings added.
Transformer stack — BERT-style encoder; depth + heads not disclosed.
Output projection — final linear layer maps hidden states back up to 2304 dims for the MSM loss target.

Training objective — Masked Shot Modeling (MSM)¶

Masking rate: 20% of shots per sequence replaced with a learnable [MASK] embedding.
Target: predict the original fused 2304-dim shot embedding at masked positions.
Loss: cosine distance between predicted + ground-truth fused embedding, summed over masked positions.
Self-supervised: no labels needed; the supervision comes from the surrounding unmasked shots + the [GLOBAL] title context.

See concepts/masked-shot-modeling for the objective + BERT-MLM-analog framing; patterns/two-stage-pretraining-contrastive-then-masked for the single-stage masked-only contrast with VideoPrism's two-stage pipeline.

Optimisation¶

Muon for hidden-layer parameters.
AdamW for the rest.
Netflix flags the switch to Muon as delivering "noticeable improvements" — no numerical ablation disclosed.

See concepts/muon-optimizer.

Evaluation — frozen-feature linear probes on five Netflix tasks¶

MediaFM's encoder is frozen after pre-training; each downstream task trains a task-specific linear layer on top of the contextualised embeddings (patterns/frozen-encoder-linear-probe).

Task	Signal	Metric	Role of MediaFM
Ad Relevancy	multi-label classification of clips to ad topics	Average Precision	retrieval stage — embeddings feed candidate-set identification upstream of the ad serving system
Clip Popularity Ranking	predicted CTR rank of clips within a title	10-fold Kendall's τ	direct ranker
Clip Tone	100 internal tone categories (creepy / scary / humorous ...)	micro Average Precision (tone-averaged)	classifier
Clip Genre	11 Netflix core genres (Action / Anime / Comedy / Documentary / Drama / Fantasy / Horror / Kids / Romance / Sci-fi / Thriller)	macro Average Precision (genre-averaged)	classifier
Clip Retrieval	human-annotated "clip-worthy" binary (1:3 pos:neg, 6–10 positives / title)	Average Precision	ranker for clip selection

MediaFM beats all reported baselines on all five tasks.

"Embedding in context" inference pattern¶

A critical deployment-time finding: when embedding a short clip (seconds to a minute), run inference on the full episode / movie containing that clip and extract the contextualised vectors for the clip's shot span — do not run inference on just the clip's shots in isolation. The contextualisation baked into each shot's embedding cannot be reproduced by post-hoc aggregation.

See concepts/embedding-in-context.

Ablation — contextualisation > multimodality¶

Netflix compares MediaFM to a uncontextualised tri-modal baseline (same tri-modal per-shot input; no transformer on top). Per-task findings:

Clip tone — multimodality helps "somewhat"; most of the lift comes from contextualisation.
Clip popularity ranking — multimodal without context is worse than a single-modality baseline; adding the transformer on top of tri-modal input lifts the model significantly above both. "Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance."
Clip retrieval — roughly +15% per improvement — adding modalities then adding context each contribute ~15%.

Motivating observation: "Improvements seem to be larger for tasks that require more detailed narrative understanding." Ad-break placement is the prototypical narrative task.

Production consumers (named at framing level)¶

Ads relevancy — retrieval stage for candidate clip identification.
Clip popularity / tone / genre / retrieval — linear probes served as part of Netflix's catalog-tagging + clip-selection pipeline.
Recsys cold start for newly-launching titles — MediaFM's content-derived embedding means a brand-new title has a usable representation from the moment audio + video + text are available, with no user-interaction data required (concepts/cold-start).
Optimised promotional assets — art + trailers.
Internal content-analysis tools.

Netflix flags that MediaFM outputs are "utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion" — a decision-support layer, not an autonomous-action layer.

Forward direction¶

Netflix explicitly flags pretrained multimodal LLMs (Qwen3-Omni is named) as a potential stronger starting point — a model where the modality fusion has already been learned at massive scale, onto which Netflix can then layer MSM / contextualisation. MediaFM-v2 may drop the tri-modal concat step in favour of a pre- trained omnimodal backbone.

Relationship to adjacent wiki systems¶

CLIP — CLIP is the single-image multimodal ancestor; MediaFM is the long-form multimodal descendant. MediaFM uses a CLIP descendant (SeqCLIP) as a frozen sub-component for the video modality.
VideoPrism (Google) — comparable video foundation model from Google. Key architectural differences:
- VideoPrism: two-stage contrastive-then-masked on raw pixels (patterns/two-stage-pretraining-contrastive-then-masked). MediaFM: masked-only, on top of frozen pre-trained unimodal encoders (SeqCLIP / wav2vec2 / OpenAI text).
- VideoPrism: spatiotemporal patches as tokens. MediaFM: shots as tokens — much coarser, much longer sequences, designed for title-scale narrative, not second-scale action.
- VideoPrism: single-modality (video + text only at training time). MediaFM: tri-modal (video + audio + text) at inference time.
patterns/multimodal-content-understanding (canonicalised from Dropbox Dash) — Dash operates at scene granularity for enterprise search; MediaFM operates at shot granularity for a single long-form title. Shots are a finer unit within a scene.

Caveats¶

No public numbers on model scale — hidden dim, layers, heads, parameter count, pre-training corpus size (beyond "tens of millions of individual shots").
No absolute metrics — only MediaFM-vs-baseline deltas shown as charts; no raw APs / Kendall's τ / numbers.
Muon win is asserted not quantified.
Title-level metadata source is shallow — only synopses + tags feed [GLOBAL].
Zero-padding missing text is a hack — modality-specific masking / gating would be more principled.
Not a systems paper — shot-boundary-detection throughput, ingestion pipeline, inference scheduling, embedding store all undescribed.
Evaluation is Netflix-internal — no public-benchmark numbers for head-to-head calibration against VideoPrism / VideoMAE / InternVideo.
"Various stages of deployment" — not all the reported wins are live in production.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; full architectural description, ablation findings, five downstream-task evaluation shape, embedding-in-context deployment rule, Muon + AdamW split, production-consumer list, Qwen3-Omni forward-direction flag.

systems/netflix-seqclip — video-modality sub-encoder.
systems/wav2vec2 — audio-modality sub-encoder.
systems/openai-text-embedding-3-large — text-modality + title- metadata sub-encoder.
systems/clip-embedding-model — architectural ancestor of SeqCLIP.
systems/videoprism — Google comparable; different corpus + architecture choices.
concepts/masked-shot-modeling — the training objective.
concepts/shot-level-embedding — the atomic unit.
concepts/embedding-in-context — inference-time deployment rule.
concepts/muon-optimizer — hidden-layer optimiser choice.
concepts/linear-probe-evaluation — eval posture.
concepts/vector-embedding — the general concept.
concepts/cold-start — the recsys role.
patterns/tri-modal-embedding-fusion — per-shot input construction.
patterns/frozen-encoder-linear-probe — deployment posture.
patterns/multimodal-content-understanding — adjacent enterprise pattern at scene granularity.
patterns/two-stage-pretraining-contrastive-then-masked — the VideoPrism contrast.
companies/netflix — producing company.