Skip to content

CONCEPT Cited by 1 source

Masked Shot Modeling (MSM)

Definition

Masked Shot Modeling (MSM) is a self-supervised pre-training objective in which a sequence model is trained to predict the original embedding of masked shots in a video, from the surrounding unmasked shots + any global-context tokens. It is the direct adaptation of BERT's Masked Language Modeling (MLM) to shots-as-tokens on long-form video — with two crucial mechanical differences:

  1. Tokens are continuous embeddings (e.g. the 2304-dim fused audio+video+text per-shot vectors in MediaFM), not discrete vocabulary items. There is no softmax over a vocabulary.
  2. Loss is cosine distance between predicted + ground-truth fused embeddings, not categorical cross-entropy.

The target is to predict the pre-existing fused embedding at the masked position — so MSM is a reconstruction-in-embedding-space objective, conceptually closer to Masked Autoencoders (MAE) than to vanilla MLM in loss shape, but taking shots as atomic units rather than pixel patches.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Canonical instance — MediaFM

MediaFM uses MSM as its sole pre-training objective:

  • Masking ratio: 20% of input shots per sequence are replaced by a learnable [MASK] embedding.
  • Sequence: up to 512 shots per title per training example, plus a learnable [CLS] and a title-metadata-derived [GLOBAL] prepended.
  • Target: the original 2304-dim fused shot embedding at each masked position (pre-mask).
  • Loss: cosine distance, summed over masked positions.

Netflix's choice is MSM-only — no contrastive stage precedes it — which is distinct from VideoPrism's contrastive-then-masked recipe. This works for MediaFM because the unimodal encoders are already pre-trained contrastively or predictively (SeqCLIP, wav2vec2, OpenAI text-embedding-3-large) — MediaFM doesn't need to re-learn semantic grounding; it only needs to learn temporal contextualisation across shots.

Why shot-level masking works

The signal that makes MSM useful is that shots are temporally correlated — a shot is more predictable given its neighbours than given random shots from the corpus. Narrative structure (scene continuity, emotional arc, dialogue turn-taking) creates a rich prediction task even on fused embeddings that already encode per-shot content. The model must learn:

  • Local continuity — the shot right before / after a masked shot usually looks, sounds, and reads similarly (same scene, same characters).
  • Longer-range narrative — dialogue callbacks, musical themes, visual motifs that recur across an episode.
  • Title-level priors (via [GLOBAL]) — genre, tone, synopsis inform what shots of a horror movie look like vs a kids' cartoon.

Contrast with pixel-patch masked video modeling

Masked-video modeling in the literature (MAE-V, VideoMAE, VideoPrism stage 2) typically masks pixel patches (sub-frame spatial regions) or spatiotemporal tubelets (patches that extend across frames). MSM is coarser — the unit is an entire shot — which gives:

  • Shorter sequences — 512 shots can cover a full 45-min episode; 512 pixel patches cover a fraction of a second of video.
  • Higher-level semantic masking — masking a shot forces the model to reason about what happens at this point in the narrative, not what pixels fill this region.
  • Dependency on upstream unimodal encoders — pixel-patch models train the visual representation from scratch; MSM defers that to the unimodal encoders and trains only the temporal contextualisation layer.

Relationship to BERT's MLM

BERT MLM MediaFM MSM
Token Word / subword ID Shot fused embedding (2304-dim)
Vocabulary Finite, discrete Effectively continuous
Target Discrete token via softmax Continuous vector via cosine distance
Mask ratio 15% 20%
Loss Cross-entropy Cosine distance
Special tokens [CLS], [SEP] [CLS], [GLOBAL], [MASK]
Global context None at sequence start [GLOBAL] token carries title metadata

The [GLOBAL] addition is the notable departure — BERT has no explicit sequence-level priors token (beyond [CLS] which is learnt bottom-up), whereas MSM injects title-level synopsis + tag information at the input, giving the model top-down priors that a video-length document benefits from in ways a sentence doesn't.

Design knobs

  • Masking ratio. MediaFM uses 20%; no ablation reported vs BERT's 15% or MAE's 75%.
  • Mask embedding sharing. MediaFM uses a single learnable [MASK]; variants could include a modality-specific mask per the three input modalities (no such split reported).
  • Loss metric. Cosine distance; L2 would be an alternative, but the unit-normalised fused-embedding input makes cosine a natural match.
  • Span masking vs token masking. The post doesn't describe whether masked shots are chosen independently or in contiguous spans (BERT has the latter variant via SpanBERT).
  • Prediction head. MediaFM maps contextualised hidden states back up to the 2304-dim fused-embedding space via a final linear output projection; deeper decoder heads (as in MAE) are not used per the post.

When MSM fits

  • Long-form content where temporal structure carries signal (video, long audio, multi-turn dialogue).
  • Stacks with strong unimodal pre-trained encoders available per modality — MSM layers on top, does not replace them.
  • Downstream tasks that benefit from contextualised atomic-unit embeddings rather than whole-sequence pooled embeddings (clip- level retrieval, scene-level ad-placement, per-shot tagging).

When MSM doesn't fit

  • Very short content — the temporal prediction signal is minimal when you only have a handful of shots.
  • Lacking a pre-trained unimodal encoder per modality — MSM assumes you have working per-modality features; if not, use contrastive or hybrid pretraining to generate them.
  • Tasks needing pixel-level reconstruction — MSM's loss is in feature space, not pixel space; you can't reconstruct actual frames from a MediaFM output.

Caveats

  • Netflix does not describe whether MSM was ablated against alternatives (e.g. adding a contrastive stage first, changing mask ratio, using span-based masking). The choice is presented as foundational rather than optimised.
  • No disclosure of whether curriculum / mask-ratio scheduling helps.
  • Loss-weighting between shot-level MSM and any auxiliary losses (global prediction, etc.) is not described — suggesting MSM is used single-loss.

Seen in

Last updated · 319 distilled / 1,201 read