SYSTEM Cited by 1 source
wav2vec 2.0 (Meta FAIR)¶
Definition¶
wav2vec 2.0 is a self-supervised speech representation learning model from Meta AI Research (FAIR) — described in the 2020 NeurIPS paper "wav2vec 2.0: A Framework for Self- Supervised Learning of Speech Representations" (Baevski et al., arXiv 2006.11477). The model pre-trains on large quantities of unlabeled audio using a masked contrastive objective over latent speech representations, producing an audio encoder whose features can be consumed by downstream tasks via fine-tuning or linear probing.
It is cited by this wiki in the context of MediaFM as the audio-modality sub-encoder: "the audio samples from the same shots are embedded using Meta FAIR's wav2vec2." (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding).
Self-supervised pre-training (paper-level summary)¶
- Input: raw audio waveform.
- Quantiser: a learnable discrete code book maps continuous latent representations to a finite vocabulary of speech units (product quantisation).
- Masking: some proportion of latent time-steps are masked.
- Objective: contrastive loss — identify the true quantised latent for each masked time-step from a set of distractors drawn from the same utterance.
- Result: a Transformer-based audio encoder that has learned speech representations from unlabeled audio at scale, usable as a drop-in encoder for downstream audio tasks.
Role in MediaFM¶
Per-shot, MediaFM extracts the audio samples for the shot's time
range and passes them through wav2vec2 to produce a fixed-length
audio embedding. This is concatenated with the SeqCLIP video
embedding + OpenAI text-embedding-3-large text embedding,
unit-normalised to a single 2304-dim fused shot embedding,
then fed to MediaFM's Transformer.
Per Netflix's footnote 2: "All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur." — audio is always present in the MediaFM input, unlike text. wav2vec2 is therefore a load-bearing dependency with no zero-pad fallback path.
See patterns/tri-modal-embedding-fusion and the broader multimodal shot fusion framing.
What's not disclosed in the Netflix use¶
- Which wav2vec2 variant / checkpoint — XLS-R? Base? Large? English-only or multilingual?
- How audio from a shot's time range is prepared (sample-rate resampling, channel mixing, duration normalisation).
- How per-shot variable-length audio collapses to a fixed-length embedding (mean pool over time? attention pool? CLS-token?).
- The individual output dimensionality of the wav2vec2 contribution to the 2304-dim per-shot fused vector.
- Whether wav2vec2 weights are frozen, partially fine-tuned, or fully fine-tuned in MediaFM.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — named as the audio-modality sub-encoder inside MediaFM.
Related¶
- systems/netflix-mediafm — primary downstream consumer (in this wiki).
- concepts/vector-embedding — general concept.
- patterns/tri-modal-embedding-fusion — consumption pattern.
- companies/meta — owning lab (FAIR).