SYSTEM Cited by 1 source
MediaFM (Netflix)¶
Definition¶
MediaFM (Netflix Media Foundational Model) is Netflix's first tri-modal (audio + video + timed-text) foundation model for media understanding — a BERT-style Transformer encoder pre-trained with a Masked Shot Modeling (MSM) self-supervised objective to produce contextual 2304-dimensional shot-level embeddings for Netflix catalog content (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).
MediaFM is a single frozen encoder designed to be the shared representation layer for many downstream Netflix services — ads relevancy, clip popularity prediction, clip tagging, recsys cold start for newly-launching titles, optimised promotional assets, and internal content-analysis tools. Netflix's choice to produce embeddings (rather than generative text outputs) is deliberate: "generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning." (footnote 1).
Architectural shape¶
Title (movie / episode)
↓ shot-boundary detection (Souček & Lokoč 2020)
[ shot_1, shot_2, ..., shot_N ] (N ≤ 512)
For each shot_i:
frames → SeqCLIP (Netflix internal) → v_i (video embedding)
audio → wav2vec2 (Meta FAIR) → a_i (audio embedding)
captions → OpenAI text-embedding-3-large → t_i (text embedding; zero if absent)
fused_i = unit_norm( concat(v_i, a_i, t_i) ) — 2304 dims
Title metadata (synopses + tags) → OpenAI text-embedding-3-large → g
Sequence input to Transformer (hidden dim h):
[ linear_proj([CLS]_learnable),
linear_proj(g),
linear_proj(fused_1), linear_proj(fused_2), ..., linear_proj(fused_N) ]
+ positional embeddings
→ BERT-style Transformer encoder
→ linear output projection back to 2304 dims (per-shot)
→ per-shot contextual embeddings for downstream tasks
Inputs¶
- Unit: a shot, obtained by shot-boundary detection on a movie or episode (collectively called a "title").
- Three unimodal encoders (all pre-trained; used as feature
extractors, not trained in MediaFM):
- Video: SeqCLIP — Netflix's internal CLIP-style model fine-tuned on video retrieval datasets; operates over frames uniformly sampled from each shot.
- Audio: Meta FAIR wav2vec2 over the shot's audio samples.
- Timed text: OpenAI
text-embedding-3-largeover closed captions / audio descriptions / subtitles for the shot's time range.
- Concatenation + unit-normalisation yields a single 2304-dim fused embedding per shot.
- Zero-padding for missing timed text — "relatively likely to occur (e.g., in shots without dialogue)" (footnote 2). Audio
- video are always present.
- Sequence length cap: up to 512 shots per title per training example.
Sequence construction¶
- Input projection — each fused shot embedding is projected down to the model's hidden dimension via a linear layer.
- Special tokens prepended:
[CLS]— a learnable embedding at position 0, BERT-style.[GLOBAL]— at position 1, constructed from title-level metadata ("synopses and tags") passed throughtext-embedding-3-largethen projected to hidden dim. Every shot attends to[GLOBAL]→ every shot gets title-level context even through a single attention layer.
- Positional embeddings added.
- Transformer stack — BERT-style encoder; depth + heads not disclosed.
- Output projection — final linear layer maps hidden states back up to 2304 dims for the MSM loss target.
Training objective — Masked Shot Modeling (MSM)¶
- Masking rate: 20% of shots per sequence replaced with a
learnable
[MASK]embedding. - Target: predict the original fused 2304-dim shot embedding at masked positions.
- Loss: cosine distance between predicted + ground-truth fused embedding, summed over masked positions.
- Self-supervised: no labels needed; the supervision comes from
the surrounding unmasked shots + the
[GLOBAL]title context.
See concepts/masked-shot-modeling for the objective + BERT-MLM-analog framing; patterns/two-stage-pretraining-contrastive-then-masked for the single-stage masked-only contrast with VideoPrism's two-stage pipeline.
Optimisation¶
- Muon for hidden-layer parameters.
- AdamW for the rest.
- Netflix flags the switch to Muon as delivering "noticeable improvements" — no numerical ablation disclosed.
Evaluation — frozen-feature linear probes on five Netflix tasks¶
MediaFM's encoder is frozen after pre-training; each downstream task trains a task-specific linear layer on top of the contextualised embeddings (patterns/frozen-encoder-linear-probe).
| Task | Signal | Metric | Role of MediaFM |
|---|---|---|---|
| Ad Relevancy | multi-label classification of clips to ad topics | Average Precision | retrieval stage — embeddings feed candidate-set identification upstream of the ad serving system |
| Clip Popularity Ranking | predicted CTR rank of clips within a title | 10-fold Kendall's τ | direct ranker |
| Clip Tone | 100 internal tone categories (creepy / scary / humorous ...) | micro Average Precision (tone-averaged) | classifier |
| Clip Genre | 11 Netflix core genres (Action / Anime / Comedy / Documentary / Drama / Fantasy / Horror / Kids / Romance / Sci-fi / Thriller) | macro Average Precision (genre-averaged) | classifier |
| Clip Retrieval | human-annotated "clip-worthy" binary (1:3 pos:neg, 6–10 positives / title) | Average Precision | ranker for clip selection |
MediaFM beats all reported baselines on all five tasks.
"Embedding in context" inference pattern¶
A critical deployment-time finding: when embedding a short clip (seconds to a minute), run inference on the full episode / movie containing that clip and extract the contextualised vectors for the clip's shot span — do not run inference on just the clip's shots in isolation. The contextualisation baked into each shot's embedding cannot be reproduced by post-hoc aggregation.
See concepts/embedding-in-context.
Ablation — contextualisation > multimodality¶
Netflix compares MediaFM to a uncontextualised tri-modal baseline (same tri-modal per-shot input; no transformer on top). Per-task findings:
- Clip tone — multimodality helps "somewhat"; most of the lift comes from contextualisation.
- Clip popularity ranking — multimodal without context is worse than a single-modality baseline; adding the transformer on top of tri-modal input lifts the model significantly above both. "Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance."
- Clip retrieval — roughly +15% per improvement — adding modalities then adding context each contribute ~15%.
Motivating observation: "Improvements seem to be larger for tasks that require more detailed narrative understanding." Ad-break placement is the prototypical narrative task.
Production consumers (named at framing level)¶
- Ads relevancy — retrieval stage for candidate clip identification.
- Clip popularity / tone / genre / retrieval — linear probes served as part of Netflix's catalog-tagging + clip-selection pipeline.
- Recsys cold start for newly-launching titles — MediaFM's content-derived embedding means a brand-new title has a usable representation from the moment audio + video + text are available, with no user-interaction data required (concepts/cold-start).
- Optimised promotional assets — art + trailers.
- Internal content-analysis tools.
Netflix flags that MediaFM outputs are "utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion" — a decision-support layer, not an autonomous-action layer.
Forward direction¶
Netflix explicitly flags pretrained multimodal LLMs (Qwen3-Omni is named) as a potential stronger starting point — a model where the modality fusion has already been learned at massive scale, onto which Netflix can then layer MSM / contextualisation. MediaFM-v2 may drop the tri-modal concat step in favour of a pre- trained omnimodal backbone.
Relationship to adjacent wiki systems¶
- CLIP — CLIP is the single-image multimodal ancestor; MediaFM is the long-form multimodal descendant. MediaFM uses a CLIP descendant (SeqCLIP) as a frozen sub-component for the video modality.
- VideoPrism (Google) — comparable
video foundation model from Google. Key architectural
differences:
- VideoPrism: two-stage contrastive-then-masked on raw pixels (patterns/two-stage-pretraining-contrastive-then-masked). MediaFM: masked-only, on top of frozen pre-trained unimodal encoders (SeqCLIP / wav2vec2 / OpenAI text).
- VideoPrism: spatiotemporal patches as tokens. MediaFM: shots as tokens — much coarser, much longer sequences, designed for title-scale narrative, not second-scale action.
- VideoPrism: single-modality (video + text only at training time). MediaFM: tri-modal (video + audio + text) at inference time.
- patterns/multimodal-content-understanding (canonicalised from Dropbox Dash) — Dash operates at scene granularity for enterprise search; MediaFM operates at shot granularity for a single long-form title. Shots are a finer unit within a scene.
Caveats¶
- No public numbers on model scale — hidden dim, layers, heads, parameter count, pre-training corpus size (beyond "tens of millions of individual shots").
- No absolute metrics — only MediaFM-vs-baseline deltas shown as charts; no raw APs / Kendall's τ / numbers.
- Muon win is asserted not quantified.
- Title-level metadata source is shallow — only synopses + tags
feed
[GLOBAL]. - Zero-padding missing text is a hack — modality-specific masking / gating would be more principled.
- Not a systems paper — shot-boundary-detection throughput, ingestion pipeline, inference scheduling, embedding store all undescribed.
- Evaluation is Netflix-internal — no public-benchmark numbers for head-to-head calibration against VideoPrism / VideoMAE / InternVideo.
- "Various stages of deployment" — not all the reported wins are live in production.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; full architectural description, ablation findings, five downstream-task evaluation shape, embedding-in-context deployment rule, Muon + AdamW split, production-consumer list, Qwen3-Omni forward-direction flag.
Related¶
- systems/netflix-seqclip — video-modality sub-encoder.
- systems/wav2vec2 — audio-modality sub-encoder.
- systems/openai-text-embedding-3-large — text-modality + title- metadata sub-encoder.
- systems/clip-embedding-model — architectural ancestor of SeqCLIP.
- systems/videoprism — Google comparable; different corpus + architecture choices.
- concepts/masked-shot-modeling — the training objective.
- concepts/shot-level-embedding — the atomic unit.
- concepts/embedding-in-context — inference-time deployment rule.
- concepts/muon-optimizer — hidden-layer optimiser choice.
- concepts/linear-probe-evaluation — eval posture.
- concepts/vector-embedding — the general concept.
- concepts/cold-start — the recsys role.
- patterns/tri-modal-embedding-fusion — per-shot input construction.
- patterns/frozen-encoder-linear-probe — deployment posture.
- patterns/multimodal-content-understanding — adjacent enterprise pattern at scene granularity.
- patterns/two-stage-pretraining-contrastive-then-masked — the VideoPrism contrast.
- companies/netflix — producing company.