PATTERN Cited by 1 source
Two-stage pre-training — contrastive then masked¶
Intent¶
Train a visual foundation-model encoder in two sequential stages to learn complementary signals that a single-stage pre-training regime cannot acquire together:
- Stage 1: cross-modal contrastive learning over paired video/image ↔ text data. Teaches the encoder to align its visual output with the semantic content of captions — effectively learning appearance features (what things look like), since captions are appearance-heavy.
- Stage 2: masked modeling over the visual data alone (captions discarded). Teaches the encoder to reconstruct masked patches — effectively learning motion/dynamics features, which captions under-describe.
The pattern is motivated by the observation that text captions and visual pixels encode complementary information. No single objective can extract both from a dataset that only has one modality paired with the other; two stages in sequence can.
Canonical instance — VideoPrism¶
systems/videoprism is the canonical in-wiki realization.
Stage 1 — video-text contrastive learning:
- Objective: CLIP-style contrastive loss (positive pairs close, negatives far) between video embeddings and text embeddings.
- Corpus: full 618M-clip hybrid — 36M clean pairs + 582M noisy pairs (concepts/hybrid-clean-noisy-training-corpus).
- Outcome: a shared video-text embedding space; the encoder learns appearance features that match caption semantics.
Stage 2 — masked-video modeling (MVM):
- Objective: predict masked patches of video-only input. Two tweaks over vanilla MVM:
- Predict both the video-level global embedding and the token-wise embeddings from the stage-1 model. This is effectively online distillation of stage-1 knowledge into stage-2 — stage-2 is initialized from stage-1 but the MVM loss is computed against stage-1's predictions, not a fresh random target, so stage-2 cannot overwrite what stage-1 learned.
- Randomly shuffle the predicted tokens post-hoc. Prevents the model from learning spatial/temporal order as a shortcut for reconstruction.
- Corpus: video clips from the stage-1 corpus without their text captions.
- Outcome: motion / temporal-dynamics features layered on top of stage-1's appearance features, without forgetting stage-1.
Mechanism — why both stages are needed¶
Stage 1 alone: Text captions encode "what things look like" — nouns, adjectives, scene descriptions. Contrastive training on (video, caption) pairs teaches the encoder to pack appearance into its features. But captions are under-specified about dynamics: a caption rarely describes how things move, the temporal order of events, or the fine-grained motion that distinguishes similar-looking actions.
Stage 2 alone: Masked-video modeling reconstructs pixels from surrounding pixels. It learns statistical regularities of video but has no semantic grounding — no concept of what the masked region represents. Features tend to cluster by visual appearance without alignment to language.
Stages 1 → 2 together: Stage 1 grounds the feature space semantically. Stage 2 refines it along the temporal axis without losing that grounding, specifically because the MVM target is stage-1's own predictions rather than a fresh objective. Each stage fixes what the other can't address.
Anti-shortcut mechanics¶
VideoPrism's blog explicitly calls out that naive two-stage MVM has a shortcut-learning risk: the model can learn to use spatial or temporal order as a cheap reconstruction signal rather than learning the content.
Token shuffling fixes this: the predicted tokens are randomly permuted before the MVM loss is applied, so position-based prediction breaks. The model must actually reconstruct content.
This is an instance of a broader pattern — self-supervised objectives need anti-shortcut mechanics because any regularity the model can exploit cheaply, it will, regardless of whether the regularity is useful.
Scheduling and compute¶
- Stage 1 dominates training compute — runs over the full hybrid corpus (618M in VideoPrism) once.
- Stage 2 is shorter — runs over the video-only portion, initialized from stage-1 weights, not training from scratch.
- Checkpoint handoff. Stage 2 loads stage-1's final checkpoint as initialization and uses stage-1's forward-pass outputs as MVM targets. Stage-1 is effectively frozen for the duration of stage-2 inference.
- Curriculum alternatives (alternating contrastive + MVM minibatches in a single run) exist in the literature but are not what VideoPrism does.
Contrast with single-stage approaches¶
| Stage-1 contrastive only (CLIP-style) | Stage-2 masked only (MAE-style) | Two-stage (VideoPrism) | |
|---|---|---|---|
| Appearance alignment to text | Strong | None | Strong (inherited from stage 1) |
| Temporal / motion features | Weak | Strong | Strong (learned in stage 2) |
| Requires paired text data | Yes (all of it) | No | Yes (stage 1 only; stage 2 doesn't need it) |
| Data efficiency | Bottlenecked by paired-data size | Bottlenecked by video-data size | Uses both; more flexible |
| Canonical wiki instance | systems/clip-embedding-model | (none ingested) | systems/videoprism |
Trade-offs¶
- More training compute. Two stages end-to-end cost more than one, even with stage-2 shorter.
- More complex pipeline. Checkpoint handoff, objective switching, anti-shortcut mechanics, MVM target from stage-1's forward pass — more moving parts than a single loss.
- Stage-1 quality is load-bearing. Stage-2 distills stage-1; if stage-1 is weak, stage-2 cannot recover. Motivates investing in corpus quality (the concepts/hybrid-clean-noisy-training-corpus) and the contrastive loss details.
- Less disclosed mechanics. Mask ratios, shuffle schedules, loss weighting between global and token-wise MVM targets — critical knobs not in the blog; in the arXiv paper.
When this fits¶
- Video or other temporal-visual modalities where captions under-describe dynamics.
- Foundation models where the goal is a general frozen encoder for many downstream tasks (see patterns/frozen-encoder-multi-task-adaptation), not task-specific fine-tuning.
- Large-scale pre-training budgets where two-stage compute is affordable.
When it doesn't fit¶
- Single-image modalities where stage-2 buys less — CLIP-style single-stage contrastive is enough because appearance is the primary signal in images.
- Small budgets — two stages double the curriculum complexity with no guaranteed payoff under limited compute.
- Workloads where the downstream objective is well-known at pre-training time — fine-tune a simpler model directly.
Seen in¶
- sources/2024-05-09-google-videoprism-foundational-visual-encoder — VideoPrism names the two-stage pipeline explicitly, flags both the global + token-wise distillation target and the token-shuffle anti-shortcut mechanic. Motivates the split via "text descriptions often focus on what things look like, while the video content provides information about movement and visual dynamics."
Related¶
- systems/videoprism — canonical instance.
- systems/clip-embedding-model — single-stage contrastive counterpart on single images.
- concepts/hybrid-clean-noisy-training-corpus — the corpus shape stage-1 typically consumes.
- patterns/frozen-encoder-multi-task-adaptation — the serving pattern that consumes the two-stage-trained encoder.