PATTERN Cited by 2 sources

Two-stage pre-training — contrastive then masked¶

Intent¶

Train a visual foundation-model encoder in two sequential stages to learn complementary signals that a single-stage pre-training regime cannot acquire together:

Stage 1: cross-modal contrastive learning over paired video/image ↔ text data. Teaches the encoder to align its visual output with the semantic content of captions — effectively learning appearance features (what things look like), since captions are appearance-heavy.
Stage 2: masked modeling over the visual data alone (captions discarded). Teaches the encoder to reconstruct masked patches — effectively learning motion/dynamics features, which captions under-describe.

The pattern is motivated by the observation that text captions and visual pixels encode complementary information. No single objective can extract both from a dataset that only has one modality paired with the other; two stages in sequence can.

Canonical instance — VideoPrism¶

systems/videoprism is the canonical in-wiki realization.

Stage 1 — video-text contrastive learning:

Objective: CLIP-style contrastive loss (positive pairs close, negatives far) between video embeddings and text embeddings.
Corpus: full 618M-clip hybrid — 36M clean pairs + 582M noisy pairs (concepts/hybrid-clean-noisy-training-corpus).
Outcome: a shared video-text embedding space; the encoder learns appearance features that match caption semantics.

Stage 2 — masked-video modeling (MVM):

Objective: predict masked patches of video-only input. Two tweaks over vanilla MVM:
Predict both the video-level global embedding and the token-wise embeddings from the stage-1 model. This is effectively online distillation of stage-1 knowledge into stage-2 — stage-2 is initialized from stage-1 but the MVM loss is computed against stage-1's predictions, not a fresh random target, so stage-2 cannot overwrite what stage-1 learned.
Randomly shuffle the predicted tokens post-hoc. Prevents the model from learning spatial/temporal order as a shortcut for reconstruction.
Corpus: video clips from the stage-1 corpus without their text captions.
Outcome: motion / temporal-dynamics features layered on top of stage-1's appearance features, without forgetting stage-1.

Mechanism — why both stages are needed¶

Stage 1 alone: Text captions encode "what things look like" — nouns, adjectives, scene descriptions. Contrastive training on (video, caption) pairs teaches the encoder to pack appearance into its features. But captions are under-specified about dynamics: a caption rarely describes how things move, the temporal order of events, or the fine-grained motion that distinguishes similar-looking actions.

Stage 2 alone: Masked-video modeling reconstructs pixels from surrounding pixels. It learns statistical regularities of video but has no semantic grounding — no concept of what the masked region represents. Features tend to cluster by visual appearance without alignment to language.

Stages 1 → 2 together: Stage 1 grounds the feature space semantically. Stage 2 refines it along the temporal axis without losing that grounding, specifically because the MVM target is stage-1's own predictions rather than a fresh objective. Each stage fixes what the other can't address.

Anti-shortcut mechanics¶

VideoPrism's blog explicitly calls out that naive two-stage MVM has a shortcut-learning risk: the model can learn to use spatial or temporal order as a cheap reconstruction signal rather than learning the content.

Token shuffling fixes this: the predicted tokens are randomly permuted before the MVM loss is applied, so position-based prediction breaks. The model must actually reconstruct content.

This is an instance of a broader pattern — self-supervised objectives need anti-shortcut mechanics because any regularity the model can exploit cheaply, it will, regardless of whether the regularity is useful.

Scheduling and compute¶

Stage 1 dominates training compute — runs over the full hybrid corpus (618M in VideoPrism) once.
Stage 2 is shorter — runs over the video-only portion, initialized from stage-1 weights, not training from scratch.
Checkpoint handoff. Stage 2 loads stage-1's final checkpoint as initialization and uses stage-1's forward-pass outputs as MVM targets. Stage-1 is effectively frozen for the duration of stage-2 inference.
Curriculum alternatives (alternating contrastive + MVM minibatches in a single run) exist in the literature but are not what VideoPrism does.

Contrast with single-stage approaches¶

	Stage-1 contrastive only (CLIP-style)	Stage-2 masked only (MAE-style)	Two-stage (VideoPrism)
Appearance alignment to text	Strong	None	Strong (inherited from stage 1)
Temporal / motion features	Weak	Strong	Strong (learned in stage 2)
Requires paired text data	Yes (all of it)	No	Yes (stage 1 only; stage 2 doesn't need it)
Data efficiency	Bottlenecked by paired-data size	Bottlenecked by video-data size	Uses both; more flexible
Canonical wiki instance	systems/clip-embedding-model	(none ingested)	systems/videoprism

Trade-offs¶

More training compute. Two stages end-to-end cost more than one, even with stage-2 shorter.
More complex pipeline. Checkpoint handoff, objective switching, anti-shortcut mechanics, MVM target from stage-1's forward pass — more moving parts than a single loss.
Stage-1 quality is load-bearing. Stage-2 distills stage-1; if stage-1 is weak, stage-2 cannot recover. Motivates investing in corpus quality (the concepts/hybrid-clean-noisy-training-corpus) and the contrastive loss details.
Less disclosed mechanics. Mask ratios, shuffle schedules, loss weighting between global and token-wise MVM targets — critical knobs not in the blog; in the arXiv paper.

When this fits¶

Video or other temporal-visual modalities where captions under-describe dynamics.
Foundation models where the goal is a general frozen encoder for many downstream tasks (see patterns/frozen-encoder-multi-task-adaptation), not task-specific fine-tuning.
Large-scale pre-training budgets where two-stage compute is affordable.

When it doesn't fit¶

Single-image modalities where stage-2 buys less — CLIP-style single-stage contrastive is enough because appearance is the primary signal in images.
Small budgets — two stages double the curriculum complexity with no guaranteed payoff under limited compute.
Workloads where the downstream objective is well-known at pre-training time — fine-tune a simpler model directly.

Seen in¶

sources/2024-05-09-google-videoprism-foundational-visual-encoder — VideoPrism names the two-stage pipeline explicitly, flags both the global + token-wise distillation target and the token-shuffle anti-shortcut mechanic. Motivates the split via "text descriptions often focus on what things look like, while the video content provides information about movement and visual dynamics."
sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — single-stage masked-only contrast. Netflix's MediaFM uses Masked Shot Modeling as its sole self-supervised objective — no contrastive stage. The choice works because MediaFM's input is not raw pixels but fused embeddings from frozen pre-trained unimodal encoders (SeqCLIP / wav2vec2 / OpenAI text-embedding-3-large) — contrastive-pretrained representations are already baked into each unimodal encoder, so MediaFM only needs to learn temporal contextualisation across shots. Complements this pattern by showing when stage-1 can be pushed out into separately-trained frozen sub-encoders, leaving MediaFM to do masked-only on top of rich features. Contrasts with VideoPrism, which trains from raw pixels and therefore needs the stage-1 contrastive phase to ground the representation semantically.

systems/videoprism — canonical instance.
systems/clip-embedding-model — single-stage contrastive counterpart on single images.
concepts/hybrid-clean-noisy-training-corpus — the corpus shape stage-1 typically consumes.
patterns/frozen-encoder-multi-task-adaptation — the serving pattern that consumes the two-stage-trained encoder.