PATTERN Cited by 1 source
Tri-modal embedding fusion¶
Intent¶
Fuse three modalities (video + audio + text) into a single per-unit embedding by running each modality through its own pre-trained encoder, concatenating the three output vectors, and unit-normalising — producing a fixed-dimensional fused representation that a downstream sequence model can then contextualise across time.
The pattern is a pragmatic shape for multimodal representation learning when (a) you have strong per-modality encoders already available, and (b) the downstream sequence model will do the heavy lifting of cross-modal + cross-time interaction.
Canonical instance — Netflix MediaFM¶
MediaFM uses tri-modal fusion per-shot (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding):
shot_i:
frames → SeqCLIP → v_i
audio → wav2vec2 → a_i
text → text-embedding-3-large → t_i (zero if absent)
fused_i = unit_norm( concat(v_i, a_i, t_i) ) — 2304 dims
The three encoders are treated as frozen feature extractors; MediaFM trains only the Transformer layer sitting on top of the fused per-shot sequence. The fused 2304 dims are then projected down to the Transformer's hidden dimension by a linear input layer before entering the self-attention stack.
Mechanism details¶
- Independent per-modality encoders. Each modality has its own pre-trained model with its own output dimensionality. Netflix's MediaFM pulls three from three different places: internal (SeqCLIP), Meta FAIR (wav2vec2), OpenAI API (text-embedding-3- large). The three are not co-trained; their output spaces are independent.
- Concatenation, not addition. Concatenation preserves each modality's full signal at known index ranges in the fused vector. Addition (or any summing fusion) would require dimension-matching + would conflate signals from different modalities into the same coordinate — losing the model's ability to route by modality.
- Unit-normalisation of the concatenation. Puts all fused vectors on a unit sphere — downstream cosine-distance losses (MediaFM uses cosine distance for its MSM objective) behave predictably; scale differences between modality sub-vectors are muted.
- Zero-padding for missing modality. A modality absent for a given unit is zero-padded to preserve concatenation shape. Netflix uses this for timed-text (absent in shots without dialogue); video + audio are always present.
- Linear projection to hidden dim before the sequence model consumes the fused vector. The 2304 → h reduction is trainable; the model learns which fused-vector coordinates to attend to.
Contrast with other multimodal fusion shapes¶
| Shape | Per-modality param sharing | Cross-modal interaction when | Canonical instance |
|---|---|---|---|
| Concat + unit-norm + downstream transformer (MediaFM) | None (frozen encoders) | In the downstream Transformer | MediaFM |
| Late attention fusion (cross-attention between modality streams) | None | In dedicated cross-attention layers | Perceiver IO, Flamingo |
| Early fusion (concat raw tokens, shared encoder) | Shared from the start | Everywhere in the encoder | PaLI, VideoLLaMA |
| Contrastive alignment (separate encoders, shared embedding space) | None; aligned via loss | Implicitly via distance | CLIP, ALIGN |
| Late fusion for classification (per-modality classifier + voting) | None | Only at the output | Per-modality-then-ensemble baselines |
Tri-modal fusion as Netflix does it is a "fuse at the input, contextualise in the encoder" choice — cheap to set up (no co-training), defers all the heavy lifting to the downstream Transformer.
Trade-offs¶
- Win — leverage existing pre-trained encoders. No need to train multimodal encoders from scratch; Netflix's three come from three different sources with no co-training.
- Win — modality-independent upgrades. Swap wav2vec2 for a newer audio encoder without retraining the video / text parts; MediaFM's upstream sub-encoder is a dependency-inject at the architectural level.
- Cost — no learned cross-modal alignment at the input. If the three encoders' output spaces have wildly different geometry, the fused vector is a hodgepodge, and the Transformer has to learn both per-modality representations and cross-modal interactions from scratch. Unit-normalisation partially mitigates but doesn't eliminate this.
- Cost — fixed dimensionality allocation. Each modality's contribution is fixed at pipeline design time (SeqCLIP's output dim + wav2vec2's + text-3-large's = 2304 in MediaFM). Shifting relative importance requires re-choosing encoders.
- Cost — missing-modality handling is crude. Zero-padding is operational but doesn't inform the model that the zero means "modality absent" vs "modality zero-valued"; modality-specific attention masks or gating is more principled, not used in MediaFM.
- Cost — linear projection is a bottleneck. The 2304 → h projection is a single linear layer; significant information compression happens here and is not modality-aware.
Why MediaFM can get away with it¶
The reason Netflix's pragmatic choice works is that the downstream Transformer is powerful enough to sort out the per-modality contribu- tions after the fact. MSM pre-training over 512-shot sequences gives the model plenty of capacity + signal to learn which fused- vector coordinates matter for which downstream tasks. If the downstream model were smaller or the pre-training objective weaker, the fusion shape would matter more.
Implementation checklist¶
- Choose three (or more) pre-trained encoders, one per modality.
- Profile each one's output dimensionality + compute cost.
- Decide on a missing-modality fallback (zero-pad, null-embedding, gating). Netflix chose zero-pad; others in literature use learnable null tokens.
- Concatenate + unit-normalise → fused vector.
- Add a linear projection to the downstream model's hidden dim.
- Train only the downstream model on top.
When this fits¶
- Pre-trained encoders exist per modality. SeqCLIP / wav2vec2 / text-embedding-3-large covers all three for MediaFM; swap in whatever is best per modality.
- Downstream sequence model is capacity-rich. A BERT-style or larger Transformer that can learn per-modality features + cross- modal attention from the fused input.
- Modalities have natural per-unit alignment. In MediaFM, a shot's video + audio + timed-text are naturally co-temporal; stitching them per-shot is a clean boundary.
When it doesn't fit¶
- No strong per-modality pre-trained encoder exists — then start with contrastive alignment or full co-training.
- Modalities have wildly different temporal rates or granularities that don't pack into a common unit (shot, sentence, etc.).
- Downstream model is small and can't absorb the burden of learning modality-routing from scratch.
- Need for learned cross-modal interactions at fusion time — cross-attention architectures (Flamingo, Perceiver) are more expressive.
Caveats¶
- Netflix does not disclose how the per-modality output sub-vectors are sized (SeqCLIP's output + wav2vec2's output + OpenAI's output sum to 2304 but individual breakdown isn't given). Different breakdowns have different implications for each modality's effective contribution.
- Pooling within-modality (e.g. wav2vec2 over a shot's audio, SeqCLIP over frames) is an upstream design choice the post doesn't characterise.
- Ablations that would resolve the above (swap-out-a-modality, zero- out-a-modality-at-inference) are not reported — MediaFM's reported ablation compares tri-modal-context vs tri-modal-flat vs video-only-flat, not modality-wise.
Seen in¶
- sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; MediaFM's per-shot fusion recipe (SeqCLIP + wav2vec2 + text-embedding-3-large → concat → unit-norm → 2304 dim → linear project to hidden dim → into BERT- style Transformer).
Related¶
- systems/netflix-mediafm — canonical consumer.
- systems/netflix-seqclip — video sub-encoder.
- systems/wav2vec2 — audio sub-encoder.
- systems/openai-text-embedding-3-large — text sub-encoder.
- concepts/shot-level-embedding — the atomic unit the fusion produces.
- concepts/vector-embedding — general concept.
- patterns/multimodal-content-understanding — adjacent ingestion- time pattern from Dropbox Dash, at scene granularity.