Skip to content

PATTERN Cited by 1 source

Two-stage pre-training — contrastive then masked

Intent

Train a visual foundation-model encoder in two sequential stages to learn complementary signals that a single-stage pre-training regime cannot acquire together:

  1. Stage 1: cross-modal contrastive learning over paired video/image ↔ text data. Teaches the encoder to align its visual output with the semantic content of captions — effectively learning appearance features (what things look like), since captions are appearance-heavy.
  2. Stage 2: masked modeling over the visual data alone (captions discarded). Teaches the encoder to reconstruct masked patches — effectively learning motion/dynamics features, which captions under-describe.

The pattern is motivated by the observation that text captions and visual pixels encode complementary information. No single objective can extract both from a dataset that only has one modality paired with the other; two stages in sequence can.

Canonical instance — VideoPrism

systems/videoprism is the canonical in-wiki realization.

Stage 1 — video-text contrastive learning:

  • Objective: CLIP-style contrastive loss (positive pairs close, negatives far) between video embeddings and text embeddings.
  • Corpus: full 618M-clip hybrid — 36M clean pairs + 582M noisy pairs (concepts/hybrid-clean-noisy-training-corpus).
  • Outcome: a shared video-text embedding space; the encoder learns appearance features that match caption semantics.

Stage 2 — masked-video modeling (MVM):

  • Objective: predict masked patches of video-only input. Two tweaks over vanilla MVM:
  • Predict both the video-level global embedding and the token-wise embeddings from the stage-1 model. This is effectively online distillation of stage-1 knowledge into stage-2 — stage-2 is initialized from stage-1 but the MVM loss is computed against stage-1's predictions, not a fresh random target, so stage-2 cannot overwrite what stage-1 learned.
  • Randomly shuffle the predicted tokens post-hoc. Prevents the model from learning spatial/temporal order as a shortcut for reconstruction.
  • Corpus: video clips from the stage-1 corpus without their text captions.
  • Outcome: motion / temporal-dynamics features layered on top of stage-1's appearance features, without forgetting stage-1.

Mechanism — why both stages are needed

Stage 1 alone: Text captions encode "what things look like" — nouns, adjectives, scene descriptions. Contrastive training on (video, caption) pairs teaches the encoder to pack appearance into its features. But captions are under-specified about dynamics: a caption rarely describes how things move, the temporal order of events, or the fine-grained motion that distinguishes similar-looking actions.

Stage 2 alone: Masked-video modeling reconstructs pixels from surrounding pixels. It learns statistical regularities of video but has no semantic grounding — no concept of what the masked region represents. Features tend to cluster by visual appearance without alignment to language.

Stages 1 → 2 together: Stage 1 grounds the feature space semantically. Stage 2 refines it along the temporal axis without losing that grounding, specifically because the MVM target is stage-1's own predictions rather than a fresh objective. Each stage fixes what the other can't address.

Anti-shortcut mechanics

VideoPrism's blog explicitly calls out that naive two-stage MVM has a shortcut-learning risk: the model can learn to use spatial or temporal order as a cheap reconstruction signal rather than learning the content.

Token shuffling fixes this: the predicted tokens are randomly permuted before the MVM loss is applied, so position-based prediction breaks. The model must actually reconstruct content.

This is an instance of a broader pattern — self-supervised objectives need anti-shortcut mechanics because any regularity the model can exploit cheaply, it will, regardless of whether the regularity is useful.

Scheduling and compute

  • Stage 1 dominates training compute — runs over the full hybrid corpus (618M in VideoPrism) once.
  • Stage 2 is shorter — runs over the video-only portion, initialized from stage-1 weights, not training from scratch.
  • Checkpoint handoff. Stage 2 loads stage-1's final checkpoint as initialization and uses stage-1's forward-pass outputs as MVM targets. Stage-1 is effectively frozen for the duration of stage-2 inference.
  • Curriculum alternatives (alternating contrastive + MVM minibatches in a single run) exist in the literature but are not what VideoPrism does.

Contrast with single-stage approaches

Stage-1 contrastive only (CLIP-style) Stage-2 masked only (MAE-style) Two-stage (VideoPrism)
Appearance alignment to text Strong None Strong (inherited from stage 1)
Temporal / motion features Weak Strong Strong (learned in stage 2)
Requires paired text data Yes (all of it) No Yes (stage 1 only; stage 2 doesn't need it)
Data efficiency Bottlenecked by paired-data size Bottlenecked by video-data size Uses both; more flexible
Canonical wiki instance systems/clip-embedding-model (none ingested) systems/videoprism

Trade-offs

  • More training compute. Two stages end-to-end cost more than one, even with stage-2 shorter.
  • More complex pipeline. Checkpoint handoff, objective switching, anti-shortcut mechanics, MVM target from stage-1's forward pass — more moving parts than a single loss.
  • Stage-1 quality is load-bearing. Stage-2 distills stage-1; if stage-1 is weak, stage-2 cannot recover. Motivates investing in corpus quality (the concepts/hybrid-clean-noisy-training-corpus) and the contrastive loss details.
  • Less disclosed mechanics. Mask ratios, shuffle schedules, loss weighting between global and token-wise MVM targets — critical knobs not in the blog; in the arXiv paper.

When this fits

  • Video or other temporal-visual modalities where captions under-describe dynamics.
  • Foundation models where the goal is a general frozen encoder for many downstream tasks (see patterns/frozen-encoder-multi-task-adaptation), not task-specific fine-tuning.
  • Large-scale pre-training budgets where two-stage compute is affordable.

When it doesn't fit

  • Single-image modalities where stage-2 buys less — CLIP-style single-stage contrastive is enough because appearance is the primary signal in images.
  • Small budgets — two stages double the curriculum complexity with no guaranteed payoff under limited compute.
  • Workloads where the downstream objective is well-known at pre-training time — fine-tune a simpler model directly.

Seen in

  • sources/2024-05-09-google-videoprism-foundational-visual-encoder — VideoPrism names the two-stage pipeline explicitly, flags both the global + token-wise distillation target and the token-shuffle anti-shortcut mechanic. Motivates the split via "text descriptions often focus on what things look like, while the video content provides information about movement and visual dynamics."
Last updated · 200 distilled / 1,178 read