Skip to content

CONCEPT Cited by 1 source

Hybrid clean/noisy training corpus

Definition

A hybrid clean/noisy training corpus is a pre-training dataset deliberately composed of two tiers:

  1. A smaller, high-quality tier — human-curated, expert-labeled, or professionally-captioned examples. Provides accurate alignment signal at the cost of corpus size (curation is the bottleneck).
  2. A larger, noisy tier — machine-generated, auto-transcribed, weakly-labeled, or crawled-with-imperfect-captions examples. Provides coverage and scale at the cost of per-example alignment quality.

The design bet is that both tiers contribute signal the other cannot: the clean tier teaches correct alignment; the noisy tier provides the coverage that any realistically-sized curated dataset cannot.

Canonical instance — VideoPrism

Google's systems/videoprism pre-training corpus:

  • 36M high-quality video-text pairs (curated captions).
  • 582M clips with noisy or machine-generated parallel text (YT-Temporal-180M, InternVid, VideoCC, WTS-70M, etc.).
  • ~618M total. The noisy tier is ~16× larger than the clean tier.

(Source: sources/2024-05-09-google-videoprism-foundational-visual-encoder)

Author framing:

"Naturally most of these videos do not have perfect captions or descriptions, even imperfect text can provide useful information about the semantic content of the video... This includes 36 million carefully selected videos with high-quality captions, along with an additional 582 million clips with varying levels of noisy text (like auto-generated transcripts)."

The CLIP similarity score histogram across the VideoPrism corpus is explicitly flagged in the blog as showing "large variations... a byproduct of the various ways used to harvest the text." The variation is evidence of (not a bug in) the hybrid construction.

Why this shape shows up

Curation is expensive and doesn't scale linearly with compute. Annotating a million high-quality captions requires humans; a billion requires fundamentally different economics. Meanwhile:

  • Model capacity keeps growing. Billion-parameter foundation models have the capacity to absorb more data than any affordable curation budget can produce.
  • The long tail is where generalization lives. A clean tier covers the head of the distribution (popular concepts, common scenes); the noisy tier pulls in the tail (rare activities, niche subjects, multilingual transcripts).
  • Noise is tolerable in the right objective. Contrastive objectives are especially robust to noisy pairs — a wrong caption contributes a weakly-wrong gradient, not a catastrophically-wrong one. The aggregate still improves the model.

Design decisions

  • Tier ratio. VideoPrism runs at ~16:1 noisy-to-clean. CLIP's ~400M pairs are often described as one undifferentiated "noisy" tier. There is no canonical ratio; it depends on (a) how much clean data exists at all, (b) the quality floor below which noisy data harms rather than helps, and (c) the objective's noise tolerance.
  • Objective design that tolerates the noise tier. VideoPrism uses the clean + noisy tiers together for its stage-1 contrastive training (contrastive is noise-tolerant); the stage-2 MVM stage uses the videos themselves, not the text, so text-noise is irrelevant to stage-2. See patterns/two-stage-pretraining-contrastive-then-masked.
  • Implicit tier signaling. Some approaches weight the loss per example by a proxy for tier (e.g. CLIP-score thresholds, source dataset confidence); VideoPrism does not describe per-tier weighting in the blog.
  • Mixing at training time vs pre-filtering. VideoPrism mixes in training; a pre-filtering approach would discard the noisy tier below a similarity threshold. Mixing is the higher-capacity approach when the objective is noise-tolerant.

Contrast with single-tier approaches

Single-tier clean Single-tier noisy Hybrid clean + noisy
Corpus size Bounded by curation budget Bounded by crawl coverage only Clean tier bounded, noisy tier unbounded — large total
Alignment quality High per-example Low per-example Bimodal per-example; aggregate sufficient for robust objectives
Coverage Narrow Broad Broad with a high-quality core
Objective requirement Any Must tolerate noise Must tolerate noise on the large tier
Canonical example Ground-truth supervised datasets Raw web-scrape pretraining VideoPrism, CLIP, most modern FMs

When this fits

  • Foundation-model pre-training where corpus size dominates model quality and curated data is the binding constraint.
  • Contrastive or self-supervised objectives that are robust to per-example label noise.
  • Multimodal alignment specifically, where paired data is the rare signal and either side can be noisy (text captions of images / video).

When it doesn't fit

  • Fine-tuning / task-specific supervised learning — noisy labels meaningfully degrade final-task accuracy; use the clean tier only.
  • High-stakes deployment where downstream decisions depend on rare-tail accuracy — noisy-tier-driven rare-tail coverage may be wrong in domain-specific ways. Evaluate separately.
  • Small models — capacity is the bottleneck, not data; a clean tier alone saturates learning.

Seen in

Last updated · 200 distilled / 1,178 read