SYSTEM Cited by 1 source
VideoPrism (Google)¶
Definition¶
VideoPrism is a video foundation model (ViFM) published by Google Research (blog 2024-05-09, arXiv 2402.13217) — a single visual encoder pre-trained once and then frozen for use across a broad set of video-understanding tasks: classification, spatiotemporal localization, video-text retrieval, captioning, question answering, and scientific-domain video tasks. Reported results hit SOTA on 30 of 33 video-understanding benchmarks with minimal task-specific adaptation.
VideoPrism is the video-modality analog of CLIP: both are frozen visual encoders designed to feed downstream adapters (text encoders for retrieval, language decoders for captioning / QA, task heads for classification). CLIP handles single-image + text; VideoPrism extends to temporal input.
(Source: sources/2024-05-09-google-videoprism-foundational-visual-encoder)
Architecture¶
- Backbone. Standard vision transformer (ViT) with
factorized spatial + temporal encoding following ViViT
(arXiv:2103.15691). Factored
attention (encode spatially, then temporally, sequentially) is the
efficiency choice over full space-time attention (which is
quadratic in
tokens × frames). - Two-stage pre-training — see patterns/two-stage-pretraining-contrastive-then-masked:
- Stage 1: video-text contrastive learning. CLIP-style objective on the full corpus (clean + noisy video-text pairs). Teaches semantic alignment to text.
- Stage 2: masked-video modeling (MVM). Self-supervised on video-only content. Model predicts masked patches. Two tweaks: (a) predicts both the video-level global embedding and the token-wise embeddings from the stage-1 model — effectively an online distillation of stage-1 into stage-2 rather than a fresh random-init; (b) predicted tokens are randomly shuffled to prevent the model from learning spatial order as a shortcut.
- Training corpus — see concepts/hybrid-clean-noisy-training-corpus:
- 36M high-quality video-text pairs (curated captions).
- 582M clips with noisy / machine-generated text (YT-Temporal-180M, InternVid, VideoCC, WTS-70M and others).
- Total ≈ 618M clips. Claimed "largest and most diverse video training corpus of its kind" (author claim; unverified).
- Downstream composition — see patterns/frozen-encoder-multi-task-adaptation:
- Video-text retrieval: VideoPrism + LiT-style text encoder.
- Captioning / QA: VideoPrism + PaLM-2 language decoder.
- Classification / localization: VideoPrism + small task head.
- Backbone weights are frozen; only the adapter layer is trained per task.
Why it matters architecturally¶
- Single serving surface for many downstream video tasks. The frozen-backbone pattern collapses what would otherwise be N per-task video models into one encoder weight set + N light adapters. Per-video-clip embeddings computed once can be cached and reused across retrieval, captioning, classification, etc.
- Hybrid-corpus shape scales data beyond curation budget. The clean tier (36M) is the practical upper bound of human-curated video-text pairs at publication time; the noisy tier (582M, ~16×) extends coverage without the curation cost. Both contribute signal the other can't. See concepts/hybrid-clean-noisy-training-corpus for the general framing.
- Two-stage pre-training learns what single-stage can't. Text captions are appearance-heavy ("a red car drives down the street"); stage-1 contrastive learning absorbs that signal. But captions under-describe dynamics — how things move, what changes. Stage-2 MVM on video-only content fills that gap without requiring paired text for the motion signal. Each stage learns signal the other cannot.
Operational numbers (from the blog)¶
- Pre-training corpus: 36M (clean) + 582M (noisy) ≈ 618M clips.
- Benchmark coverage: SOTA on 30 of 33 video-understanding benchmarks; four categories (classification/localization, retrieval, captioning/QA, scientific).
- Downstream adaptation: "minimal" adaptation of a single frozen model; adapter parameter counts not disclosed.
- No serving numbers disclosed — no QPS, latency, fleet size, cost/inference, embedding dimensionality, or cache semantics. The blog is a research-announcement post; serving architecture is entirely absent.
Contrast with CLIP¶
| CLIP | VideoPrism | |
|---|---|---|
| Modality | Single image + text | Video (temporal token stream) + text |
| Pre-training stages | Single-stage contrastive | Two-stage: contrastive + masked-video modeling |
| Pre-training corpus | ~400M (image, text) pairs (public web) | 36M clean + 582M noisy video-text pairs |
| Backbone | ResNet / ViT | ViT + ViViT factorized attention |
| Shared embedding space | Text + image | Video + text (via stage-1 contrastive) |
| Typical adapter | Text encoder; small classifier heads | Text encoder (LiT); language decoder (PaLM-2); small heads |
| Open weights | Yes | No public weights at blog publication |
| Canonical production deployment in this wiki | systems/figma-ai-search | None observed |
Caveats¶
- Not observed deployed in this wiki. No ingested source so far describes a production system that consumes VideoPrism inference. Unlike CLIP (systems/figma-ai-search serves it in production), VideoPrism as of its blog announcement is a research artifact.
- "Surpasses domain experts" on scientific datasets — Fly vs. Fly, CalMS21, ChimpACT, KABR. Benchmark claim, not deployment claim. Real-world adoption by those domains not discussed.
- Private-data component. The 36M clean tier is "several public and private datasets"; the private portion is not enumerated (likely Google-internal YouTube-derived corpora). Model not reproducible from the blog alone.
- No architectural / loss-function depth in the blog — factored attention details, MVM loss formulation, shuffle mechanics, mask ratios, adapter shapes are deferred to the arXiv paper. Wiki treatment tracks the blog, not the paper.
Seen in¶
- sources/2024-05-09-google-videoprism-foundational-visual-encoder — VideoPrism's announcement post; names the two-stage pipeline, the 36M + 582M corpus split, the ViT/ViViT backbone, and the frozen-encoder + downstream-adapter serving pattern. 30/33 benchmark claim. Predecessor models named: VideoCLIP, InternVideo, VideoCoCa, UMT.