CONCEPT Cited by 1 source

Shot-level embedding¶

Definition¶

A shot-level embedding is a vector representation computed per shot of a piece of long-form video content — where a "shot" is the atomic temporal unit of video produced by shot- boundary detection, typically corresponding to a single continuous camera take before an edit / cut.

Shot-level embeddings sit in a hierarchy of video representation granularity:

title (movie / episode)
  └── scene            (narratively coherent group of shots, ~minutes)
       └── shot        (single continuous take, ~1-10 seconds)
            └── frame  (single image, 1/24th or 1/30th of a second)
                 └── patch / tubelet (sub-frame spatial / spatiotemporal region)

Most published video-understanding models operate on frames or patches (e.g. ViViT, VideoMAE, VideoPrism — patches or tubelets); shot-level operates one layer up. The MediaFM post (Netflix, 2026-02-23) is the canonical wiki source for shot as the atomic unit of a foundation model.

(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)

Why shots as the unit¶

Narrative coherence. Within a shot, the camera POV + subjects
action are continuous by construction. Within a frame-span of 1s, you may span multiple shots in a rapid-cut sequence (a fight scene can have 30+ shots in a minute); within a patch, you have no semantic grouping at all. Shots are the smallest unit where "what's happening" is well-defined.
Sequence length budget. A typical 45-min Netflix episode has ~500-1000 shots; MediaFM caps at 512 shots per sequence. That fits a full episode's narrative into one self-attention window. Operating at frame granularity on the same content would require ~80 000 frames — infeasible for a BERT-style Transformer.
Modality alignment. Audio + captions + video all naturally align to a shot's time range. At patch granularity this alignment breaks — a patch has no audio, no caption.

Constructing a shot-level embedding (MediaFM recipe)¶

Shot-boundary detection on the full title to segment into shots.
Per-modality encoders run over the shot's time range:
- Video — frames uniformly sampled → pass through a video encoder (SeqCLIP) → per-shot video embedding.
- Audio — shot's audio samples → pass through an audio encoder (wav2vec2) → per-shot audio embedding.
- Text — shot's captions / AD / subtitles → pass through a text embedder (OpenAI text-embedding-3-large) → per-shot text embedding (zero-pad if absent).
Concatenate the three unimodal embeddings + unit-normalise → one fused per-shot vector (2304 dims in MediaFM).
(Downstream) — feed the sequence of fused shot embeddings through a sequence model (MediaFM's BERT- style encoder) to produce contextualised shot-level embeddings that incorporate information from surrounding shots + title-level metadata via the [GLOBAL] token.

The distinction between raw shot-level embedding (step 3) and contextualised shot-level embedding (step 4) matters — MediaFM's ablation shows most downstream-task gain comes from the contextualisation step.

What shot-level embeddings enable¶

Clip-level tasks at narrative resolution. A "clip" in Netflix's taxonomy is a short span of shots (seconds to a minute); clip-level tasks (ad relevance, tone, genre, retrieval, popularity) map directly to aggregations over shot-level embeddings.
Recsys cold-start for new titles. Content-derived representation available at launch, before any user-interaction data exists (concepts/cold-start).
Scene segmentation as downstream. Clustering / segmenting shot-level embeddings yields scene boundaries — a form of unsupervised scene discovery.
Efficient retrieval. Shot-granularity embeddings are ~100-1000× fewer than frame-granularity, making vector-store indexing of entire Netflix-scale catalogs feasible.

Caveats¶

Upstream failure modes. If shot-boundary detection over-segments (splitting a single long take into many false sub-shots) or under-segments (missing cuts), the representation shifts. The cited Souček & Lokoč 2020 paper is MediaFM's detector.
Zero-padding asymmetry. Shots without dialogue have a zero-padded text slot; shots without audio or video (rare) would break the pipeline — see wav2vec2 always- present assumption.
Lost sub-shot detail. Sub-shot events (a gunshot mid-take, a reaction shot's specific facial expression) are subsumed into the shot-level vector; tasks needing finer temporal resolution need a different representation.
Shot duration varies from fractions of a second to ~30 seconds. Fixed-dim per-shot representation discards duration signal unless explicitly re-added via positional embeddings or metadata.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; defines shots as MediaFM's atomic unit, describes tri-modal concatenation recipe producing a 2304-dim fused per-shot embedding.

systems/netflix-mediafm — canonical instance.
systems/netflix-seqclip / systems/wav2vec2 / systems/openai-text-embedding-3-large — per-modality sub-encoders.
concepts/vector-embedding — general concept.
concepts/masked-shot-modeling — self-supervised objective over sequences of shot-level embeddings.
concepts/embedding-in-context — inference-time rule for extracting shot-level embeddings from a long sequence.
patterns/multimodal-content-understanding — Dropbox Dash's scene-granularity ancestor of this pattern.
patterns/tri-modal-embedding-fusion — the per-shot fusion construction.