Skip to content

SYSTEM Cited by 1 source

SeqCLIP (Netflix)

Definition

SeqCLIP is a Netflix-internal CLIP-style model fine-tuned on video retrieval datasets that produces a video embedding from a set of frames sampled at uniform intervals from a video shot. It is referenced (but not stand-alone documented) in Netflix's MediaFM blog post as the video- modality sub-encoder: "an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots." (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

The linked internal reference is "Building in Video Search" (netflixtechblog.com article) — not yet ingested into this wiki.

Relationship to CLIP

  • Base: a CLIP-style dual-encoder contrastive model (systems/clip-embedding-model) — text + image encoders trained to align in a shared vector space via contrastive loss.
  • Netflix fine-tune: the fine-tuning corpus is video retrieval datasets rather than web image-caption pairs. The model consumes sampled frames (image-like inputs) and produces embeddings that are optimised for video-retrieval queries — i.e. match-a-video-by-text-description.
  • Netflix does not disclose whether SeqCLIP's text tower is used downstream at MediaFM (the text modality there is produced by OpenAI's text-embedding-3-large, not SeqCLIP-text), so the operational dependency is likely only on the SeqCLIP image tower as a frozen frame encoder.

Role in MediaFM

Per-shot, MediaFM samples frames at uniform intervals from the shot, passes each through SeqCLIP, and aggregates (method not disclosed in the post — likely average / attention pool, but the post doesn't commit) into a single per-shot video embedding. The SeqCLIP output is then concatenated with wav2vec2's audio embedding + OpenAI text-embedding-3-large's text embedding, unit-normalised, and fed to MediaFM's Transformer encoder as the fused 2304-dim per-shot token. See patterns/tri-modal-embedding-fusion.

What's not disclosed

  • Backbone — ResNet, ViT, or custom?
  • Output dimensionality of SeqCLIP → individual contribution to the 2304-dim per-shot fused embedding.
  • Frame-sampling rate into SeqCLIP per shot.
  • Per-frame pool method (mean, attention, CLS-token).
  • Fine-tuning corpus specifics (retrieval datasets used, losses).
  • Deployment surface (batched on GPU, SageMaker-equivalent internal, etc.).
  • Whether SeqCLIP is used elsewhere at Netflix outside MediaFM.

Seen in

Last updated · 319 distilled / 1,201 read