SYSTEM Cited by 1 source

SeqCLIP (Netflix)¶

Definition¶

SeqCLIP is a Netflix-internal CLIP-style model fine-tuned on video retrieval datasets that produces a video embedding from a set of frames sampled at uniform intervals from a video shot. It is referenced (but not stand-alone documented) in Netflix's MediaFM blog post as the video- modality sub-encoder: "an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots." (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

The linked internal reference is "Building in Video Search" (netflixtechblog.com article) — not yet ingested into this wiki.

Relationship to CLIP¶

Base: a CLIP-style dual-encoder contrastive model (systems/clip-embedding-model) — text + image encoders trained to align in a shared vector space via contrastive loss.
Netflix fine-tune: the fine-tuning corpus is video retrieval datasets rather than web image-caption pairs. The model consumes sampled frames (image-like inputs) and produces embeddings that are optimised for video-retrieval queries — i.e. match-a-video-by-text-description.
Netflix does not disclose whether SeqCLIP's text tower is used downstream at MediaFM (the text modality there is produced by OpenAI's text-embedding-3-large, not SeqCLIP-text), so the operational dependency is likely only on the SeqCLIP image tower as a frozen frame encoder.

Role in MediaFM¶

Per-shot, MediaFM samples frames at uniform intervals from the shot, passes each through SeqCLIP, and aggregates (method not disclosed in the post — likely average / attention pool, but the post doesn't commit) into a single per-shot video embedding. The SeqCLIP output is then concatenated with wav2vec2's audio embedding + OpenAI text-embedding-3-large's text embedding, unit-normalised, and fed to MediaFM's Transformer encoder as the fused 2304-dim per-shot token. See patterns/tri-modal-embedding-fusion.

What's not disclosed¶

Backbone — ResNet, ViT, or custom?
Output dimensionality of SeqCLIP → individual contribution to the 2304-dim per-shot fused embedding.
Frame-sampling rate into SeqCLIP per shot.
Per-frame pool method (mean, attention, CLS-token).
Fine-tuning corpus specifics (retrieval datasets used, losses).
Deployment surface (batched on GPU, SageMaker-equivalent internal, etc.).
Whether SeqCLIP is used elsewhere at Netflix outside MediaFM.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — single-sentence attribution as MediaFM's video-modality sub-encoder.

systems/netflix-mediafm — primary consumer.
systems/clip-embedding-model — CLIP family (architectural ancestor).
concepts/vector-embedding — general concept.
patterns/tri-modal-embedding-fusion — consumption pattern.
companies/netflix — owning company.