Skip to content

CONCEPT Cited by 1 source

Shot-level embedding

Definition

A shot-level embedding is a vector representation computed per shot of a piece of long-form video content — where a "shot" is the atomic temporal unit of video produced by shot- boundary detection, typically corresponding to a single continuous camera take before an edit / cut.

Shot-level embeddings sit in a hierarchy of video representation granularity:

title (movie / episode)
  └── scene            (narratively coherent group of shots, ~minutes)
       └── shot        (single continuous take, ~1-10 seconds)
            └── frame  (single image, 1/24th or 1/30th of a second)
                 └── patch / tubelet (sub-frame spatial / spatiotemporal region)

Most published video-understanding models operate on frames or patches (e.g. ViViT, VideoMAE, VideoPrism — patches or tubelets); shot-level operates one layer up. The MediaFM post (Netflix, 2026-02-23) is the canonical wiki source for shot as the atomic unit of a foundation model.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Why shots as the unit

  • Narrative coherence. Within a shot, the camera POV + subjects
  • action are continuous by construction. Within a frame-span of 1s, you may span multiple shots in a rapid-cut sequence (a fight scene can have 30+ shots in a minute); within a patch, you have no semantic grouping at all. Shots are the smallest unit where "what's happening" is well-defined.
  • Sequence length budget. A typical 45-min Netflix episode has ~500-1000 shots; MediaFM caps at 512 shots per sequence. That fits a full episode's narrative into one self-attention window. Operating at frame granularity on the same content would require ~80 000 frames — infeasible for a BERT-style Transformer.
  • Modality alignment. Audio + captions + video all naturally align to a shot's time range. At patch granularity this alignment breaks — a patch has no audio, no caption.

Constructing a shot-level embedding (MediaFM recipe)

  1. Shot-boundary detection on the full title to segment into shots.
  2. Per-modality encoders run over the shot's time range:
    • Video — frames uniformly sampled → pass through a video encoder (SeqCLIP) → per-shot video embedding.
    • Audio — shot's audio samples → pass through an audio encoder (wav2vec2) → per-shot audio embedding.
    • Text — shot's captions / AD / subtitles → pass through a text embedder (OpenAI text-embedding-3-large) → per-shot text embedding (zero-pad if absent).
  3. Concatenate the three unimodal embeddings + unit-normalise → one fused per-shot vector (2304 dims in MediaFM).
  4. (Downstream) — feed the sequence of fused shot embeddings through a sequence model (MediaFM's BERT- style encoder) to produce contextualised shot-level embeddings that incorporate information from surrounding shots + title-level metadata via the [GLOBAL] token.

The distinction between raw shot-level embedding (step 3) and contextualised shot-level embedding (step 4) matters — MediaFM's ablation shows most downstream-task gain comes from the contextualisation step.

What shot-level embeddings enable

  • Clip-level tasks at narrative resolution. A "clip" in Netflix's taxonomy is a short span of shots (seconds to a minute); clip-level tasks (ad relevance, tone, genre, retrieval, popularity) map directly to aggregations over shot-level embeddings.
  • Recsys cold-start for new titles. Content-derived representation available at launch, before any user-interaction data exists (concepts/cold-start).
  • Scene segmentation as downstream. Clustering / segmenting shot-level embeddings yields scene boundaries — a form of unsupervised scene discovery.
  • Efficient retrieval. Shot-granularity embeddings are ~100-1000× fewer than frame-granularity, making vector-store indexing of entire Netflix-scale catalogs feasible.

Caveats

  • Upstream failure modes. If shot-boundary detection over-segments (splitting a single long take into many false sub-shots) or under-segments (missing cuts), the representation shifts. The cited Souček & Lokoč 2020 paper is MediaFM's detector.
  • Zero-padding asymmetry. Shots without dialogue have a zero-padded text slot; shots without audio or video (rare) would break the pipeline — see wav2vec2 always- present assumption.
  • Lost sub-shot detail. Sub-shot events (a gunshot mid-take, a reaction shot's specific facial expression) are subsumed into the shot-level vector; tasks needing finer temporal resolution need a different representation.
  • Shot duration varies from fractions of a second to ~30 seconds. Fixed-dim per-shot representation discards duration signal unless explicitly re-added via positional embeddings or metadata.

Seen in

Last updated · 319 distilled / 1,201 read