CONCEPT Cited by 1 source

Hybrid clean/noisy training corpus¶

Definition¶

A hybrid clean/noisy training corpus is a pre-training dataset deliberately composed of two tiers:

A smaller, high-quality tier — human-curated, expert-labeled, or professionally-captioned examples. Provides accurate alignment signal at the cost of corpus size (curation is the bottleneck).
A larger, noisy tier — machine-generated, auto-transcribed, weakly-labeled, or crawled-with-imperfect-captions examples. Provides coverage and scale at the cost of per-example alignment quality.

The design bet is that both tiers contribute signal the other cannot: the clean tier teaches correct alignment; the noisy tier provides the coverage that any realistically-sized curated dataset cannot.

Canonical instance — VideoPrism¶

Google's systems/videoprism pre-training corpus:

36M high-quality video-text pairs (curated captions).
582M clips with noisy or machine-generated parallel text (YT-Temporal-180M, InternVid, VideoCC, WTS-70M, etc.).
~618M total. The noisy tier is ~16× larger than the clean tier.

(Source: sources/2024-05-09-google-videoprism-foundational-visual-encoder)

Author framing:

"Naturally most of these videos do not have perfect captions or descriptions, even imperfect text can provide useful information about the semantic content of the video... This includes 36 million carefully selected videos with high-quality captions, along with an additional 582 million clips with varying levels of noisy text (like auto-generated transcripts)."

The CLIP similarity score histogram across the VideoPrism corpus is explicitly flagged in the blog as showing "large variations... a byproduct of the various ways used to harvest the text." The variation is evidence of (not a bug in) the hybrid construction.

Why this shape shows up¶

Curation is expensive and doesn't scale linearly with compute. Annotating a million high-quality captions requires humans; a billion requires fundamentally different economics. Meanwhile:

Model capacity keeps growing. Billion-parameter foundation models have the capacity to absorb more data than any affordable curation budget can produce.
The long tail is where generalization lives. A clean tier covers the head of the distribution (popular concepts, common scenes); the noisy tier pulls in the tail (rare activities, niche subjects, multilingual transcripts).
Noise is tolerable in the right objective. Contrastive objectives are especially robust to noisy pairs — a wrong caption contributes a weakly-wrong gradient, not a catastrophically-wrong one. The aggregate still improves the model.

Design decisions¶

Tier ratio. VideoPrism runs at ~16:1 noisy-to-clean. CLIP's ~400M pairs are often described as one undifferentiated "noisy" tier. There is no canonical ratio; it depends on (a) how much clean data exists at all, (b) the quality floor below which noisy data harms rather than helps, and (c) the objective's noise tolerance.
Objective design that tolerates the noise tier. VideoPrism uses the clean + noisy tiers together for its stage-1 contrastive training (contrastive is noise-tolerant); the stage-2 MVM stage uses the videos themselves, not the text, so text-noise is irrelevant to stage-2. See patterns/two-stage-pretraining-contrastive-then-masked.
Implicit tier signaling. Some approaches weight the loss per example by a proxy for tier (e.g. CLIP-score thresholds, source dataset confidence); VideoPrism does not describe per-tier weighting in the blog.
Mixing at training time vs pre-filtering. VideoPrism mixes in training; a pre-filtering approach would discard the noisy tier below a similarity threshold. Mixing is the higher-capacity approach when the objective is noise-tolerant.

Contrast with single-tier approaches¶

	Single-tier clean	Single-tier noisy	Hybrid clean + noisy
Corpus size	Bounded by curation budget	Bounded by crawl coverage only	Clean tier bounded, noisy tier unbounded — large total
Alignment quality	High per-example	Low per-example	Bimodal per-example; aggregate sufficient for robust objectives
Coverage	Narrow	Broad	Broad with a high-quality core
Objective requirement	Any	Must tolerate noise	Must tolerate noise on the large tier
Canonical example	Ground-truth supervised datasets	Raw web-scrape pretraining	VideoPrism, CLIP, most modern FMs

When this fits¶

Foundation-model pre-training where corpus size dominates model quality and curated data is the binding constraint.
Contrastive or self-supervised objectives that are robust to per-example label noise.
Multimodal alignment specifically, where paired data is the rare signal and either side can be noisy (text captions of images / video).

When it doesn't fit¶

Fine-tuning / task-specific supervised learning — noisy labels meaningfully degrade final-task accuracy; use the clean tier only.
High-stakes deployment where downstream decisions depend on rare-tail accuracy — noisy-tier-driven rare-tail coverage may be wrong in domain-specific ways. Evaluate separately.
Small models — capacity is the bottleneck, not data; a clean tier alone saturates learning.

Seen in¶

sources/2024-05-09-google-videoprism-foundational-visual-encoder — 36M clean + 582M noisy video-text pairs; author explicitly flags the CLIP-score variation across tiers; stage-1 contrastive training is the noise-tolerant objective that consumes both.

systems/videoprism — canonical instance.
systems/clip-embedding-model — adjacent instance on single images; ~400M (image, text) pairs mostly described as one tier but functionally a similar web-scrape hybrid.
patterns/two-stage-pretraining-contrastive-then-masked — the objective-design pattern that consumes this corpus shape.
concepts/vector-embedding — the downstream artifact produced from training on this corpus.