Skip to content

SYSTEM Cited by 1 source

wav2vec 2.0 (Meta FAIR)

Definition

wav2vec 2.0 is a self-supervised speech representation learning model from Meta AI Research (FAIR) — described in the 2020 NeurIPS paper "wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations" (Baevski et al., arXiv 2006.11477). The model pre-trains on large quantities of unlabeled audio using a masked contrastive objective over latent speech representations, producing an audio encoder whose features can be consumed by downstream tasks via fine-tuning or linear probing.

It is cited by this wiki in the context of MediaFM as the audio-modality sub-encoder: "the audio samples from the same shots are embedded using Meta FAIR's wav2vec2." (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).

Self-supervised pre-training (paper-level summary)

  • Input: raw audio waveform.
  • Quantiser: a learnable discrete code book maps continuous latent representations to a finite vocabulary of speech units (product quantisation).
  • Masking: some proportion of latent time-steps are masked.
  • Objective: contrastive loss — identify the true quantised latent for each masked time-step from a set of distractors drawn from the same utterance.
  • Result: a Transformer-based audio encoder that has learned speech representations from unlabeled audio at scale, usable as a drop-in encoder for downstream audio tasks.

Role in MediaFM

Per-shot, MediaFM extracts the audio samples for the shot's time range and passes them through wav2vec2 to produce a fixed-length audio embedding. This is concatenated with the SeqCLIP video embedding + OpenAI text-embedding-3-large text embedding, unit-normalised to a single 2304-dim fused shot embedding, then fed to MediaFM's Transformer.

Per Netflix's footnote 2: "All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur." — audio is always present in the MediaFM input, unlike text. wav2vec2 is therefore a load-bearing dependency with no zero-pad fallback path.

See patterns/tri-modal-embedding-fusion and the broader multimodal shot fusion framing.

What's not disclosed in the Netflix use

  • Which wav2vec2 variant / checkpoint — XLS-R? Base? Large? English-only or multilingual?
  • How audio from a shot's time range is prepared (sample-rate resampling, channel mixing, duration normalisation).
  • How per-shot variable-length audio collapses to a fixed-length embedding (mean pool over time? attention pool? CLS-token?).
  • The individual output dimensionality of the wav2vec2 contribution to the 2304-dim per-shot fused vector.
  • Whether wav2vec2 weights are frozen, partially fine-tuned, or fully fine-tuned in MediaFM.

Seen in

Last updated · 319 distilled / 1,201 read