Netflix — MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix¶
Summary¶
Netflix's Machine Learning + Content Intelligence team describes
MediaFM — Netflix's first tri-modal (audio + video + timed
text) foundation model for media understanding. MediaFM is a
BERT-style Transformer encoder that ingests sequences of
shots (up to 512 per title) from Netflix's catalog and produces
contextual 2304-dimensional shot-level embeddings. The
pre-training objective is Masked Shot Modeling (MSM) — randomly
mask 20% of input shots, predict the original fused embedding at
masked positions by minimising cosine distance to the ground truth.
Shot-level input is constructed by concatenating three unimodal
embeddings: SeqCLIP (CLIP-style video
encoder fine-tuned on video retrieval) for the video frames,
Meta FAIR wav2vec2 for the audio, and OpenAI
text-embedding-3-large for the timed-text (captions / audio
descriptions / subtitles); each shot's three embeddings are
concatenated + unit-normalised to a single 2304-dim vector, then
projected to the model's hidden dim. Two special tokens are
prepended: a learnable [CLS] and a [GLOBAL] token carrying
title-level metadata (synopsis + tags, also embedded via OpenAI
text-embedding-3-large). The resulting contextualised embeddings
beat several strong external + internal baselines on five
downstream Netflix tasks evaluated via frozen-feature linear
probes: ad relevancy (mAP), clip popularity ranking (Kendall's
τ), clip tone classification (micro AP), clip genre classification
(macro AP), and clip retrieval (AP). Ablations isolate that
contextualisation — not additional modalities — delivers most of
MediaFM's gain, especially on tasks requiring narrative
understanding (ad-break placement, clip popularity). Key
engineering choice flagged as differentially impactful: the Muon
optimizer (for hidden parameters, with AdamW for the rest) gave
"noticeable improvements" over pure-AdamW training. Embeddings
feed production capabilities at Netflix including cold-start of
newly-launching titles in recommendations, optimised promotional
assets (art + trailers), ads, and internal content-analysis tools.
Key takeaways¶
- First tri-modal content-embedding foundation model at Netflix. MediaFM integrates audio + video + timed text into a single shot-level contextual embedding designed as the shared representation layer for many downstream Netflix services — ads relevancy, clip popularity / tone / genre / retrieval, and cold- start for new titles in recommendations. Embeddings-over-generative- text is a deliberate architectural choice justified in footnote 1: generate once, reuse everywhere across services; cleaner abstraction than fine-tuning per service. (patterns/tri-modal-embedding-fusion; concepts/shot-level-embedding.)
- Shot is the fundamental unit. A shot — obtained from
shot-boundary detection
segmenting a title — is the atomic input. Each shot gets three
independent modality embeddings: SeqCLIP (Netflix's
fine-tuned video retriever; systems/netflix-seqclip) over
uniformly-sampled frames, wav2vec2 (systems/wav2vec2) over
the shot's audio, and OpenAI
text-embedding-3-large(systems/openai-text-embedding-3-large) over the timed text (captions / AD / subtitles). The three are concatenated (rather than, say, late-fused via attention) and unit-normalised to a single 2304-dim fused vector per shot. Missing timed-text is zero-padded (footnote 2 — "we zero-pad for missing timed text data, which is relatively likely to occur (e.g., in shots without dialogue)"); audio + video are always present. - BERT-style sequence model operates on shot sequences up to 512
shots per title. Each pre-training example is a temporally
ordered sequence of fused shot embeddings from one movie or
episode, capped at 512 shots, projected down to the model's
hidden dimension by a linear input layer. Two special tokens are
prepended: a learnable
[CLS](first position) and a[GLOBAL]token (second position) built from title-level metadata ("synopses and tags") — also embedded via OpenAItext-embedding-3-large, then projected.[GLOBAL]participates in self-attention with every shot, so every shot sees title-level context even through a single attention layer. Positional embeddings are added before the Transformer stack; a final linear output projection maps hidden states back up to the original 2304-dim fused-embedding space for the MSM prediction target. - Masked Shot Modeling (MSM) — self-supervised objective
directly adapted from BERT's MLM. 20% of input fused-shot
embeddings are replaced with a learnable
[MASK]embedding; the model predicts the original fused embedding at masked positions. Loss = cosine distance between predicted + ground- truth fused embedding per masked position. Training is entirely self-supervised — no labeled data required; the supervision comes from the surrounding unmasked shots + the title-level[GLOBAL]context. (concepts/masked-shot-modeling.) - Muon optimizer for hidden parameters is a differential engineering win. Hidden-layer parameters are optimised with Muon; remaining parameters with AdamW. "It's worth noting that the switch to Muon resulted in noticeable improvements." This is the first wiki ingest naming Muon as a pre-training optimizer; a load-bearing but under-specified lever in the post. (concepts/muon-optimizer.)
- Evaluation is linear-probe on frozen MediaFM features across
five Netflix-specific tasks. Standard foundation-model
evaluation posture: the encoder is frozen after pre-training,
and task-specific linear layers are learned on top of the
output contextual embeddings. Five reported tasks, each with a
task-appropriate metric:
- Ad Relevancy — multi-label clip classification for ad placement; Average Precision. Embeddings operate at the retrieval stage (candidate-set identification) feeding the ad serving system.
- Clip Popularity Ranking — predict clip CTR rank relative to other clips from the same title; 10-fold Kendall's τ.
- Clip Tone — 100-category multi-label tone classification ("creepy, scary, humorous" e.g.) from Netflix Metadata & Ratings; micro Average Precision (tone-averaged).
- Clip Genre — multi-label classification into the 11 Netflix core genres (Action / Anime / Comedy / Documentary / Drama / Fantasy / Horror / Kids / Romance / Sci-fi / Thriller) derived from parent-title genre; macro Average Precision (genre-averaged).
- Clip Retrieval — binary "clip-worthy" classification per human annotators (1:3 positive:negative ratio, 6–10 positives per title + matching negatives); AP. MediaFM beats all baselines on all five. (patterns/frozen-encoder-linear-probe; concepts/linear-probe-evaluation.)
- "Embedding in context" beats "embedding the clip in isolation". Critical evaluation finding: when the downstream task is clip-level (seconds-to-a-minute), extracting the embedding from within the larger sequence (the full episode containing the clip) does materially better than extracting embeddings from only the shots that make up the clip. The transformer's contextualisation baked into each shot's embedding cannot be reproduced by post-hoc aggregation — you need to run inference on the full title-level sequence and pull out the contextualised vectors for the clip span of interest. (concepts/embedding-in-context.)
- Ablation — contextualisation dominates; multimodality is a
smaller add-on, mostly positive. The ablation baseline
concatenates the three uncontextualised unimodal embeddings
per shot — same input, no transformer. Findings (per the three
reported charts):
- Clip tone — additional modalities help "somewhat"; the main improvement comes from contextualisation.
- Clip popularity ranking — "oddly, multiple uncontextualised modalities hurts the clip popularity ranking model" — the extra modalities without contextualisation actually degrade performance over the unimodal baseline. Adding the transformer on top of the tri-modal input lifts performance significantly above both uncontextualised baselines.
- Clip retrieval — "natural progression of around 15% for each improvement" — approximately +15% from adding modalities then another +15% from contextualisation. The per-task pattern motivates Netflix's investment in the transformer over "just fuse and train a classifier" — for some tasks, multimodal fusion without context is strictly worse than video-only.
- Narrative-understanding tasks benefit most. "Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context." Ad-break placement is the prototypical narrative task — you need to understand what's about to happen + what just happened at that moment in the title.
- Production consumers (disclosed at the framing level, not architected in detail): ads relevancy retrieval stage, clip popularity prediction, clip tagging (tone / genre / retrieval via linear probes), cold start of newly-launching titles in recommendations (embeddings are content-derived, so a brand-new title has a usable representation from the moment its audio + video
- text are available — no user-interaction data required — see concepts/cold-start), optimised promotional assets (art + trailers), and internal content-analysis tools. Netflix flags pointedly: "the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion." — i.e. MediaFM is a decision-support layer, not an automated-action layer.
- Forward direction: swap pretraining for off-the-shelf multimodal LLMs. "We are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like Qwen3-Omni, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations." The future substitute for "train the fusion yourself" is a pre-trained omnimodal foundation model.
Systems / concepts / patterns extracted¶
Systems¶
- systems/netflix-mediafm — the model itself (new wiki page).
- systems/netflix-seqclip — Netflix's internal CLIP-style video encoder fine-tuned on video retrieval datasets; used to embed frames uniformly sampled from each shot (new wiki page).
- systems/wav2vec2 — Meta FAIR's self-supervised speech representation model (arXiv 2006.11477); used to embed shot audio (new wiki page).
- systems/openai-text-embedding-3-large — OpenAI's large
text-embedding model; used for timed-text per shot AND for
title-level metadata (synopses + tags) fed into the
[GLOBAL]token (new wiki page). - OpenAI CLIP — CLIP family referenced as the architectural lineage of SeqCLIP. No direct MediaFM use, but SeqCLIP is "a CLIP-style model fine-tuned on video retrieval datasets".
Concepts (new)¶
- concepts/masked-shot-modeling — MediaFM's self-supervised pre-training objective; BERT's MLM applied to shots as tokens with cosine distance over fused continuous embeddings replacing softmax over a discrete vocabulary.
- concepts/shot-level-embedding — shot (not frame, not title) as the atomic unit of contextual representation for long-form video understanding; fixed 2304-dim fused-embedding shape in MediaFM.
- concepts/embedding-in-context — at inference time, embed a short clip by running the full containing-episode through the encoder and extracting the clip-span's contextualised vectors — not by running the clip alone. Material quality delta per Netflix's evaluation.
- concepts/muon-optimizer — second-order optimiser for hidden-layer parameters of Transformer models; outperformed AdamW on MediaFM hidden params.
- concepts/linear-probe-evaluation — frozen-feature evaluation: fit a linear layer per downstream task on top of frozen foundation-model features; MediaFM's evaluation posture for all five tasks.
Patterns (new)¶
- patterns/tri-modal-embedding-fusion — per-shot: concatenate unimodal embeddings from three modalities (video + audio + text), unit-normalise, project into the sequence-model's hidden dim, then let a downstream Transformer contextualise across time. Unit-normalisation + concatenation is a deliberate simplicity choice vs late attention-based fusion.
- patterns/frozen-encoder-linear-probe — train a large self-supervised encoder once; evaluate + deploy downstream tasks via a linear layer on top of frozen features. Standard foundation- model evaluation posture; MediaFM canonicalises it at Netflix.
Extensions to existing pages¶
- companies/netflix — new section: MediaFM / multimodal media-understanding foundation model (first in-wiki Netflix ingest on content-embedding model axis, distinct from ML platform / workflow orchestrator / codec / observability / media-production- suite / data-gateway / knowledge-graph / chaos axes). Recent articles prepended; frontmatter sources + related + tags updated.
- concepts/vector-embedding — extended with the MediaFM tri-modal concatenation → 2304 dims instance as a multimodal- embedding example beyond single-image CLIP. Also the content- derived embedding → recsys cold-start use case.
- concepts/cold-start — extended with MediaFM as the Netflix recsys instance: content-derived embedding from audio + video + text means a new title has a usable representation at launch, no user-interaction data needed.
- patterns/multimodal-content-understanding — extended with MediaFM's long-form temporal video case — shot-level contextual embeddings are the missing piece when the scene- segment-level understanding of the canonical Dropbox Dash framing doesn't capture narrative across scenes. Netflix shows the level below scene segmentation: at shot granularity, learn temporal dependencies across a whole episode.
- patterns/two-stage-pretraining-contrastive-then-masked — extended with MediaFM as an MSM-only contrast: no contrastive stage (since the upstream unimodal encoders — SeqCLIP / wav2vec2 / OpenAI — are already pre-trained contrastively / predictively in their own right; MediaFM layers MSM on top of those frozen embeddings). Whereas VideoPrism does contrastive-then-masked from scratch on raw pixels, MediaFM does masked only on top of frozen unimodal features. Complementary shape for the same broad pattern.
- systems/clip-embedding-model — extended with SeqCLIP as a Netflix-internal CLIP descendant (video-fine-tuned).
Operational numbers¶
- Three unimodal encoders: SeqCLIP (video, internal), wav2vec2 (audio, Meta FAIR), OpenAI text-embedding-3-large (text).
- Per-shot fused embedding: 2304 dims (concatenation of the three unimodal vectors, unit-normalised).
- Sequence length: up to 512 shots per title per training example.
- Mask ratio: 20% of shots per sequence replaced with a
learnable
[MASK]embedding for the MSM objective. - Loss: cosine distance between predicted + ground-truth fused embedding at masked positions.
- Special tokens per sequence:
[CLS](position 0, learnable) [GLOBAL](position 1, title-metadata-derived).- Optimisers: Muon for hidden parameters; AdamW for the rest.
- Downstream-task metrics (reported but not numerically disclosed): ad relevancy (AP), clip popularity ranking (10-fold Kendall's τ), clip tone (micro AP), clip genre (macro AP), clip retrieval (AP).
- Clip-retrieval positive:negative ratio: 1:3, with 6–10 positive clips per title + matching negatives.
- Ablation headline on clip retrieval: "natural progression of around 15% for each improvement" (adding modalities → +15%; adding contextualisation → another +15%).
- Downstream genres: 11 core (Action / Anime / Comedy / Documentary / Drama / Fantasy / Horror / Kids / Romance / Sci-fi / Thriller).
- Downstream tone categories: 100.
No public numbers disclosed for: model scale (hidden dim, layers, heads, parameter count), pre-training corpus size ("tens of millions of individual shots across multiple titles" is the only figure), training compute, absolute task metrics on any of the five evaluations (only MediaFM-vs-baseline deltas shown as charts), frame-sampling rate per shot into SeqCLIP, SeqCLIP / wav2vec2 / OpenAI-text-embedding-3-large output dims individually before concatenation, deployment latency, or inference batch size.
Caveats¶
- Evaluation is all linear-probes on Netflix-specific tasks. No comparison on standard public video-understanding benchmarks (MSR-VTT, Kinetics, etc.) where external foundation models (VideoPrism, VideoMAE, InternVideo) could be compared head-to- head. Netflix's evaluation is tuned to Netflix decision-support tasks — good for validating the internal investment, harder to externally calibrate.
- Baselines named in aggregate ("strong external + internal models") but individual identities only partially disclosed. Footnote 3 mentions VertexAI MM and Marengo (Twelve Labs) — those are the named external comparables. No explicit named Netflix internal baseline (SeqCLIP-alone? prior tri-modal concat-without-transformer?) — only implicit per the ablation chart captions.
- Muon is flagged as impactful but not characterised. "The switch to Muon resulted in noticeable improvements" is the entire evidence on Muon; no loss-curve, no ablation-delta, no which-layers.
- Title-level metadata source is shallow. The
[GLOBAL]token is built from "synopses and tags"; Netflix has much richer title-level structure (cast, episodic information, critical reception, genre taxonomy). A deeper metadata ingest could be a future improvement. - Zero-padding missing-text is a hack, not a principled fix. Shots without dialogue zero-pad the text-modality slot. The paper acknowledges it's "relatively likely to occur"; a stronger approach (modality-specific masking, per-modality gating) is not described.
- Production-ingress plumbing unexplained. The post is a model paper, not a systems paper. Shot-boundary-detection throughput at catalog scale, the training-data pipeline, how inference is scheduled for a new title at launch, and the embedding-store are not described. MediaFM as an artefact is documented; MediaFM as an infrastructure component is not.
- "Various stages of deployment" hedging. The post is explicit that the downstream improvements are "in various stages of deployment" — not all reported MediaFM-driven wins are actually shipping.
- Forward-direction is an open question. Netflix flags Qwen3-Omni / pretrained multimodal LLMs as a potential replacement for the "fuse yourself" approach but has not reported results of that exploration.
- Shot-boundary detection is load-bearing upstream. The atomic unit is the shot; wrong shot boundaries (over-segmenting a long handheld take, under-segmenting a rapid-cut sequence) would shift the representation. The cited paper (Souček & Lokoč, 2020) is the upstream; its failure modes propagate.
Source¶
- Original: https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d?source=rss----2615bd06b42e---4
- Raw markdown:
raw/netflix/2026-02-23-mediafm-the-multimodal-ai-foundation-for-media-understanding-1f5b9645.md
Related¶
- companies/netflix
- Systems: systems/netflix-mediafm · systems/netflix-seqclip · systems/wav2vec2 · systems/openai-text-embedding-3-large · systems/clip-embedding-model
- Concepts: concepts/masked-shot-modeling · concepts/shot-level-embedding · concepts/embedding-in-context · concepts/muon-optimizer · concepts/linear-probe-evaluation · concepts/vector-embedding · concepts/cold-start
- Patterns: patterns/tri-modal-embedding-fusion · patterns/frozen-encoder-linear-probe · patterns/multimodal-content-understanding · patterns/two-stage-pretraining-contrastive-then-masked