SYSTEM Cited by 1 source

VideoPrism (Google)¶

Definition¶

VideoPrism is a video foundation model (ViFM) published by Google Research (blog 2024-05-09, arXiv 2402.13217) — a single visual encoder pre-trained once and then frozen for use across a broad set of video-understanding tasks: classification, spatiotemporal localization, video-text retrieval, captioning, question answering, and scientific-domain video tasks. Reported results hit SOTA on 30 of 33 video-understanding benchmarks with minimal task-specific adaptation.

VideoPrism is the video-modality analog of CLIP: both are frozen visual encoders designed to feed downstream adapters (text encoders for retrieval, language decoders for captioning / QA, task heads for classification). CLIP handles single-image + text; VideoPrism extends to temporal input.

(Source: sources/2024-05-09-google-videoprism-foundational-visual-encoder)

Architecture¶

Backbone. Standard vision transformer (ViT) with factorized spatial + temporal encoding following ViViT (arXiv:2103.15691). Factored attention (encode spatially, then temporally, sequentially) is the efficiency choice over full space-time attention (which is quadratic in tokens × frames).
Two-stage pre-training — see patterns/two-stage-pretraining-contrastive-then-masked:
Stage 1: video-text contrastive learning. CLIP-style objective on the full corpus (clean + noisy video-text pairs). Teaches semantic alignment to text.
Stage 2: masked-video modeling (MVM). Self-supervised on video-only content. Model predicts masked patches. Two tweaks: (a) predicts both the video-level global embedding and the token-wise embeddings from the stage-1 model — effectively an online distillation of stage-1 into stage-2 rather than a fresh random-init; (b) predicted tokens are randomly shuffled to prevent the model from learning spatial order as a shortcut.
Training corpus — see concepts/hybrid-clean-noisy-training-corpus:
36M high-quality video-text pairs (curated captions).
582M clips with noisy / machine-generated text (YT-Temporal-180M, InternVid, VideoCC, WTS-70M and others).
Total ≈ 618M clips. Claimed "largest and most diverse video training corpus of its kind" (author claim; unverified).
Downstream composition — see patterns/frozen-encoder-multi-task-adaptation:
Video-text retrieval: VideoPrism + LiT-style text encoder.
Captioning / QA: VideoPrism + PaLM-2 language decoder.
Classification / localization: VideoPrism + small task head.
Backbone weights are frozen; only the adapter layer is trained per task.

Why it matters architecturally¶

Single serving surface for many downstream video tasks. The frozen-backbone pattern collapses what would otherwise be N per-task video models into one encoder weight set + N light adapters. Per-video-clip embeddings computed once can be cached and reused across retrieval, captioning, classification, etc.
Hybrid-corpus shape scales data beyond curation budget. The clean tier (36M) is the practical upper bound of human-curated video-text pairs at publication time; the noisy tier (582M, ~16×) extends coverage without the curation cost. Both contribute signal the other can't. See concepts/hybrid-clean-noisy-training-corpus for the general framing.
Two-stage pre-training learns what single-stage can't. Text captions are appearance-heavy ("a red car drives down the street"); stage-1 contrastive learning absorbs that signal. But captions under-describe dynamics — how things move, what changes. Stage-2 MVM on video-only content fills that gap without requiring paired text for the motion signal. Each stage learns signal the other cannot.

Operational numbers (from the blog)¶

Pre-training corpus: 36M (clean) + 582M (noisy) ≈ 618M clips.
Benchmark coverage: SOTA on 30 of 33 video-understanding benchmarks; four categories (classification/localization, retrieval, captioning/QA, scientific).
Downstream adaptation: "minimal" adaptation of a single frozen model; adapter parameter counts not disclosed.
No serving numbers disclosed — no QPS, latency, fleet size, cost/inference, embedding dimensionality, or cache semantics. The blog is a research-announcement post; serving architecture is entirely absent.

Contrast with CLIP¶

	CLIP	VideoPrism
Modality	Single image + text	Video (temporal token stream) + text
Pre-training stages	Single-stage contrastive	Two-stage: contrastive + masked-video modeling
Pre-training corpus	~400M (image, text) pairs (public web)	36M clean + 582M noisy video-text pairs
Backbone	ResNet / ViT	ViT + ViViT factorized attention
Shared embedding space	Text + image	Video + text (via stage-1 contrastive)
Typical adapter	Text encoder; small classifier heads	Text encoder (LiT); language decoder (PaLM-2); small heads
Open weights	Yes	No public weights at blog publication
Canonical production deployment in this wiki	systems/figma-ai-search	None observed

Caveats¶

Not observed deployed in this wiki. No ingested source so far describes a production system that consumes VideoPrism inference. Unlike CLIP (systems/figma-ai-search serves it in production), VideoPrism as of its blog announcement is a research artifact.
"Surpasses domain experts" on scientific datasets — Fly vs. Fly, CalMS21, ChimpACT, KABR. Benchmark claim, not deployment claim. Real-world adoption by those domains not discussed.
Private-data component. The 36M clean tier is "several public and private datasets"; the private portion is not enumerated (likely Google-internal YouTube-derived corpora). Model not reproducible from the blog alone.
No architectural / loss-function depth in the blog — factored attention details, MVM loss formulation, shuffle mechanics, mask ratios, adapter shapes are deferred to the arXiv paper. Wiki treatment tracks the blog, not the paper.

Seen in¶

sources/2024-05-09-google-videoprism-foundational-visual-encoder — VideoPrism's announcement post; names the two-stage pipeline, the 36M + 582M corpus split, the ViT/ViViT backbone, and the frozen-encoder + downstream-adapter serving pattern. 30/33 benchmark claim. Predecessor models named: VideoCLIP, InternVideo, VideoCoCa, UMT.