Skip to content

CONCEPT Cited by 1 source

Embedding in context

Definition

Embedding in context is an inference-time deployment rule: when using a Transformer encoder that produces contextualised embeddings over a sequence, extract a sub-span's embeddings by running inference on the full sequence containing the sub-span and slicing out the contextualised vectors at the sub-span's positions — not by running inference on only the sub-span's tokens in isolation.

The rule is specifically called out in Netflix's MediaFM post (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding):

"When embedding these clips, we find that 'embedding in context', namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip."

Why context matters

A contextualised embedding, by construction, incorporates information from other positions in the input sequence through self-attention. If you truncate the input to only the sub-span of interest, the self-attention mechanism has no other positions to attend to, so the resulting embedding loses the dependency on the larger context that the pre-training objective taught the model to use.

In MediaFM, this shows up clearly. The model's MSM pre-training teaches it to predict a masked shot's embedding from all the other shots in the same title. At inference time, if you ask for a clip's embeddings using only the clip's 30 shots, you force the model into a distribution-shift situation it was never trained on — the [GLOBAL] token still works, positional embeddings still work, but the cross-shot contextualisation that was the point of pre-training has nothing to chew on.

MediaFM deployment shape

For Netflix downstream tasks that operate at clip granularity (seconds-to-a-minute spans), the deployment path is:

  1. Identify the containing title (movie / episode) for the clip.
  2. Run MediaFM inference on the full title — the sequence of up to 512 shot embeddings + [CLS] + [GLOBAL].
  3. Slice out the contextualised output vectors at the shot positions that correspond to the clip's time range.
  4. Feed those contextualised vectors to the downstream linear- probe classifier / ranker.

The alternative — run MediaFM on only the clip's shot span — gives measurably worse downstream task quality, per Netflix's reported ablation framing.

Relationship to "full-document embedding" in text

The concept has a direct analog in NLP: when embedding a span of a document (a sentence, paragraph) with a contextualised encoder like BERT, you get better results by running the full document and extracting the span's token embeddings than by running the span alone. This is especially clear for long-document encoders (Longformer, BigBird) where the pre-training context is explicitly longer than a sentence.

MediaFM's video-domain instance is the same principle scaled up to long-form video: the pre-training context is a full episode, so inference should also be on a full episode.

Operational implications

  • Inference latency is title-scale, not clip-scale. A clip-level downstream task still requires title-level inference — which is up-to-512-shots-per-title Transformer forward pass. Caching title-level outputs once and slicing per clip amortises this.
  • Caching is natural. Once a title has been processed, the per- shot contextualised embeddings can be stored and reused across downstream tasks. MediaFM is explicitly positioned as a "representation [generated] once, and consumed across our entire suite of services" (footnote 1).
  • New content triggers full-title inference. When a new title launches, MediaFM needs to process the whole thing before any clip-level application can use it; but that's a one-shot cost per title.
  • Re-running with updated models. If MediaFM is upgraded, all titles need re-inference — a catalogue-wide refresh job. No incremental / partial-update path is described.

Contrast with "embedding just the span"

Embed the span alone Embed in context
Inference cost per clip Low — runs on ~30 shots High — runs on ~500-1000 shots (amortised via caching)
Output dependency on surrounding content None Full self-attention scope
Cold-start behaviour Can compute before rest of title is available Requires the whole title
Quality (MediaFM-reported) Lower "Much better"

When this matters

  • Any Transformer-encoder with long-sequence pre-training. MediaFM, Longformer, video foundation models, protein-language models at whole-protein scale.
  • Downstream tasks on sub-spans. Clip-level, sentence-level, residue-level — anywhere the task granularity is finer than pre-training granularity.
  • Evaluation methodology. When benchmarking a contextualised encoder against a uncontextualised baseline, the encoder must be run in context or the comparison is unfairly rigged against it.

When it doesn't matter

  • Models trained at span granularity. If the pre-training objective is already span-level (e.g. sentence encoders like Sentence-BERT), running on the span alone is what the model was trained for.
  • Pooling models. If the downstream use is a whole-sequence pooled representation (CLS token over full sequence), there's no slicing / sub-span concern.

Caveats

  • Netflix reports "much better" qualitatively — no explicit delta- number on each of the five downstream tasks comparing span-alone vs in-context inference.
  • The rule assumes the full-title context is actually available at inference time. Real-time use cases where the downstream application knows only a live clip (e.g. real-time content tagging of a live stream) would need a different approach.

Seen in

Last updated · 319 distilled / 1,201 read