SYSTEM Cited by 1 source
SeqCLIP (Netflix)¶
Definition¶
SeqCLIP is a Netflix-internal CLIP-style model fine-tuned on video retrieval datasets that produces a video embedding from a set of frames sampled at uniform intervals from a video shot. It is referenced (but not stand-alone documented) in Netflix's MediaFM blog post as the video- modality sub-encoder: "an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots." (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).
The linked internal reference is "Building in Video Search" (netflixtechblog.com article) — not yet ingested into this wiki.
Relationship to CLIP¶
- Base: a CLIP-style dual-encoder contrastive model (systems/clip-embedding-model) — text + image encoders trained to align in a shared vector space via contrastive loss.
- Netflix fine-tune: the fine-tuning corpus is video retrieval datasets rather than web image-caption pairs. The model consumes sampled frames (image-like inputs) and produces embeddings that are optimised for video-retrieval queries — i.e. match-a-video-by-text-description.
- Netflix does not disclose whether SeqCLIP's text tower is
used downstream at MediaFM (the text modality there is produced
by OpenAI's
text-embedding-3-large, not SeqCLIP-text), so the operational dependency is likely only on the SeqCLIP image tower as a frozen frame encoder.
Role in MediaFM¶
Per-shot, MediaFM samples frames at uniform intervals from the
shot, passes each through SeqCLIP, and aggregates (method not
disclosed in the post — likely average / attention pool, but the
post doesn't commit) into a single per-shot video embedding.
The SeqCLIP output is then concatenated with wav2vec2's audio
embedding + OpenAI text-embedding-3-large's text embedding,
unit-normalised, and fed to MediaFM's Transformer encoder as the
fused 2304-dim per-shot token.
See patterns/tri-modal-embedding-fusion.
What's not disclosed¶
- Backbone — ResNet, ViT, or custom?
- Output dimensionality of SeqCLIP → individual contribution to the 2304-dim per-shot fused embedding.
- Frame-sampling rate into SeqCLIP per shot.
- Per-frame pool method (mean, attention, CLS-token).
- Fine-tuning corpus specifics (retrieval datasets used, losses).
- Deployment surface (batched on GPU, SageMaker-equivalent internal, etc.).
- Whether SeqCLIP is used elsewhere at Netflix outside MediaFM.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — single-sentence attribution as MediaFM's video-modality sub-encoder.
Related¶
- systems/netflix-mediafm — primary consumer.
- systems/clip-embedding-model — CLIP family (architectural ancestor).
- concepts/vector-embedding — general concept.
- patterns/tri-modal-embedding-fusion — consumption pattern.
- companies/netflix — owning company.