CONCEPT Cited by 1 source
Shot-level embedding¶
Definition¶
A shot-level embedding is a vector representation computed per shot of a piece of long-form video content — where a "shot" is the atomic temporal unit of video produced by shot- boundary detection, typically corresponding to a single continuous camera take before an edit / cut.
Shot-level embeddings sit in a hierarchy of video representation granularity:
title (movie / episode)
└── scene (narratively coherent group of shots, ~minutes)
└── shot (single continuous take, ~1-10 seconds)
└── frame (single image, 1/24th or 1/30th of a second)
└── patch / tubelet (sub-frame spatial / spatiotemporal region)
Most published video-understanding models operate on frames or patches (e.g. ViViT, VideoMAE, VideoPrism — patches or tubelets); shot-level operates one layer up. The MediaFM post (Netflix, 2026-02-23) is the canonical wiki source for shot as the atomic unit of a foundation model.
(Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding)
Why shots as the unit¶
- Narrative coherence. Within a shot, the camera POV + subjects
- action are continuous by construction. Within a frame-span of 1s, you may span multiple shots in a rapid-cut sequence (a fight scene can have 30+ shots in a minute); within a patch, you have no semantic grouping at all. Shots are the smallest unit where "what's happening" is well-defined.
- Sequence length budget. A typical 45-min Netflix episode has ~500-1000 shots; MediaFM caps at 512 shots per sequence. That fits a full episode's narrative into one self-attention window. Operating at frame granularity on the same content would require ~80 000 frames — infeasible for a BERT-style Transformer.
- Modality alignment. Audio + captions + video all naturally align to a shot's time range. At patch granularity this alignment breaks — a patch has no audio, no caption.
Constructing a shot-level embedding (MediaFM recipe)¶
- Shot-boundary detection on the full title to segment into shots.
- Per-modality encoders run over the shot's time range:
- Video — frames uniformly sampled → pass through a video encoder (SeqCLIP) → per-shot video embedding.
- Audio — shot's audio samples → pass through an audio encoder (wav2vec2) → per-shot audio embedding.
- Text — shot's captions / AD / subtitles → pass through a text embedder (OpenAI text-embedding-3-large) → per-shot text embedding (zero-pad if absent).
- Concatenate the three unimodal embeddings + unit-normalise → one fused per-shot vector (2304 dims in MediaFM).
- (Downstream) — feed the sequence of fused shot embeddings through
a sequence model (MediaFM's BERT-
style encoder) to produce contextualised shot-level embeddings
that incorporate information from surrounding shots + title-level
metadata via the
[GLOBAL]token.
The distinction between raw shot-level embedding (step 3) and contextualised shot-level embedding (step 4) matters — MediaFM's ablation shows most downstream-task gain comes from the contextualisation step.
What shot-level embeddings enable¶
- Clip-level tasks at narrative resolution. A "clip" in Netflix's taxonomy is a short span of shots (seconds to a minute); clip-level tasks (ad relevance, tone, genre, retrieval, popularity) map directly to aggregations over shot-level embeddings.
- Recsys cold-start for new titles. Content-derived representation available at launch, before any user-interaction data exists (concepts/cold-start).
- Scene segmentation as downstream. Clustering / segmenting shot-level embeddings yields scene boundaries — a form of unsupervised scene discovery.
- Efficient retrieval. Shot-granularity embeddings are ~100-1000× fewer than frame-granularity, making vector-store indexing of entire Netflix-scale catalogs feasible.
Caveats¶
- Upstream failure modes. If shot-boundary detection over-segments (splitting a single long take into many false sub-shots) or under-segments (missing cuts), the representation shifts. The cited Souček & Lokoč 2020 paper is MediaFM's detector.
- Zero-padding asymmetry. Shots without dialogue have a zero-padded text slot; shots without audio or video (rare) would break the pipeline — see wav2vec2 always- present assumption.
- Lost sub-shot detail. Sub-shot events (a gunshot mid-take, a reaction shot's specific facial expression) are subsumed into the shot-level vector; tasks needing finer temporal resolution need a different representation.
- Shot duration varies from fractions of a second to ~30 seconds. Fixed-dim per-shot representation discards duration signal unless explicitly re-added via positional embeddings or metadata.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; defines shots as MediaFM's atomic unit, describes tri-modal concatenation recipe producing a 2304-dim fused per-shot embedding.
Related¶
- systems/netflix-mediafm — canonical instance.
- systems/netflix-seqclip / systems/wav2vec2 / systems/openai-text-embedding-3-large — per-modality sub-encoders.
- concepts/vector-embedding — general concept.
- concepts/masked-shot-modeling — self-supervised objective over sequences of shot-level embeddings.
- concepts/embedding-in-context — inference-time rule for extracting shot-level embeddings from a long sequence.
- patterns/multimodal-content-understanding — Dropbox Dash's scene-granularity ancestor of this pattern.
- patterns/tri-modal-embedding-fusion — the per-shot fusion construction.