Tri-modal embedding fusion¶

PATTERN Cited by 1 source

Intent¶

Fuse three modalities (video + audio + text) into a single per-unit embedding by running each modality through its own pre-trained encoder, concatenating the three output vectors, and unit-normalising — producing a fixed-dimensional fused representation that a downstream sequence model can then contextualise across time.

The pattern is a pragmatic shape for multimodal representation learning when (a) you have strong per-modality encoders already available, and (b) the downstream sequence model will do the heavy lifting of cross-modal + cross-time interaction.

Canonical instance — Netflix MediaFM¶

MediaFM uses tri-modal fusion per-shot (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding):

shot_i:
  frames → SeqCLIP                    → v_i
  audio  → wav2vec2                   → a_i
  text   → text-embedding-3-large     → t_i  (zero if absent)

  fused_i = unit_norm( concat(v_i, a_i, t_i) )   — 2304 dims

The three encoders are treated as frozen feature extractors; MediaFM trains only the Transformer layer sitting on top of the fused per-shot sequence. The fused 2304 dims are then projected down to the Transformer's hidden dimension by a linear input layer before entering the self-attention stack.

Mechanism details¶

Independent per-modality encoders. Each modality has its own pre-trained model with its own output dimensionality. Netflix's MediaFM pulls three from three different places: internal (SeqCLIP), Meta FAIR (wav2vec2), OpenAI API (text-embedding-3- large). The three are not co-trained; their output spaces are independent.
Concatenation, not addition. Concatenation preserves each modality's full signal at known index ranges in the fused vector. Addition (or any summing fusion) would require dimension-matching + would conflate signals from different modalities into the same coordinate — losing the model's ability to route by modality.
Unit-normalisation of the concatenation. Puts all fused vectors on a unit sphere — downstream cosine-distance losses (MediaFM uses cosine distance for its MSM objective) behave predictably; scale differences between modality sub-vectors are muted.
Zero-padding for missing modality. A modality absent for a given unit is zero-padded to preserve concatenation shape. Netflix uses this for timed-text (absent in shots without dialogue); video + audio are always present.
Linear projection to hidden dim before the sequence model consumes the fused vector. The 2304 → h reduction is trainable; the model learns which fused-vector coordinates to attend to.

Contrast with other multimodal fusion shapes¶

Shape	Per-modality param sharing	Cross-modal interaction when	Canonical instance
Concat + unit-norm + downstream transformer (MediaFM)	None (frozen encoders)	In the downstream Transformer	MediaFM
Late attention fusion (cross-attention between modality streams)	None	In dedicated cross-attention layers	Perceiver IO, Flamingo
Early fusion (concat raw tokens, shared encoder)	Shared from the start	Everywhere in the encoder	PaLI, VideoLLaMA
Contrastive alignment (separate encoders, shared embedding space)	None; aligned via loss	Implicitly via distance	CLIP, ALIGN
Late fusion for classification (per-modality classifier + voting)	None	Only at the output	Per-modality-then-ensemble baselines

Tri-modal fusion as Netflix does it is a "fuse at the input, contextualise in the encoder" choice — cheap to set up (no co-training), defers all the heavy lifting to the downstream Transformer.

Trade-offs¶

Win — leverage existing pre-trained encoders. No need to train multimodal encoders from scratch; Netflix's three come from three different sources with no co-training.
Win — modality-independent upgrades. Swap wav2vec2 for a newer audio encoder without retraining the video / text parts; MediaFM's upstream sub-encoder is a dependency-inject at the architectural level.
Cost — no learned cross-modal alignment at the input. If the three encoders' output spaces have wildly different geometry, the fused vector is a hodgepodge, and the Transformer has to learn both per-modality representations and cross-modal interactions from scratch. Unit-normalisation partially mitigates but doesn't eliminate this.
Cost — fixed dimensionality allocation. Each modality's contribution is fixed at pipeline design time (SeqCLIP's output dim + wav2vec2's + text-3-large's = 2304 in MediaFM). Shifting relative importance requires re-choosing encoders.
Cost — missing-modality handling is crude. Zero-padding is operational but doesn't inform the model that the zero means "modality absent" vs "modality zero-valued"; modality-specific attention masks or gating is more principled, not used in MediaFM.
Cost — linear projection is a bottleneck. The 2304 → h projection is a single linear layer; significant information compression happens here and is not modality-aware.

Why MediaFM can get away with it¶

The reason Netflix's pragmatic choice works is that the downstream Transformer is powerful enough to sort out the per-modality contribu- tions after the fact. MSM pre-training over 512-shot sequences gives the model plenty of capacity + signal to learn which fused- vector coordinates matter for which downstream tasks. If the downstream model were smaller or the pre-training objective weaker, the fusion shape would matter more.

Implementation checklist¶

Choose three (or more) pre-trained encoders, one per modality.
Profile each one's output dimensionality + compute cost.
Decide on a missing-modality fallback (zero-pad, null-embedding, gating). Netflix chose zero-pad; others in literature use learnable null tokens.
Concatenate + unit-normalise → fused vector.
Add a linear projection to the downstream model's hidden dim.
Train only the downstream model on top.

When this fits¶

Pre-trained encoders exist per modality. SeqCLIP / wav2vec2 / text-embedding-3-large covers all three for MediaFM; swap in whatever is best per modality.
Downstream sequence model is capacity-rich. A BERT-style or larger Transformer that can learn per-modality features + cross- modal attention from the fused input.
Modalities have natural per-unit alignment. In MediaFM, a shot's video + audio + timed-text are naturally co-temporal; stitching them per-shot is a clean boundary.

When it doesn't fit¶

No strong per-modality pre-trained encoder exists — then start with contrastive alignment or full co-training.
Modalities have wildly different temporal rates or granularities that don't pack into a common unit (shot, sentence, etc.).
Downstream model is small and can't absorb the burden of learning modality-routing from scratch.
Need for learned cross-modal interactions at fusion time — cross-attention architectures (Flamingo, Perceiver) are more expressive.

Caveats¶

Netflix does not disclose how the per-modality output sub-vectors are sized (SeqCLIP's output + wav2vec2's output + OpenAI's output sum to 2304 but individual breakdown isn't given). Different breakdowns have different implications for each modality's effective contribution.
Pooling within-modality (e.g. wav2vec2 over a shot's audio, SeqCLIP over frames) is an upstream design choice the post doesn't characterise.
Ablations that would resolve the above (swap-out-a-modality, zero- out-a-modality-at-inference) are not reported — MediaFM's reported ablation compares tri-modal-context vs tri-modal-flat vs video-only-flat, not modality-wise.

Seen in¶

sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding — canonical wiki source; MediaFM's per-shot fusion recipe (SeqCLIP + wav2vec2 + text-embedding-3-large → concat → unit-norm → 2304 dim → linear project to hidden dim → into BERT- style Transformer).

systems/netflix-mediafm — canonical consumer.
systems/netflix-seqclip — video sub-encoder.
systems/wav2vec2 — audio sub-encoder.
systems/openai-text-embedding-3-large — text sub-encoder.
concepts/shot-level-embedding — the atomic unit the fusion produces.
concepts/vector-embedding — general concept.
patterns/multimodal-content-understanding — adjacent ingestion- time pattern from Dropbox Dash, at scene granularity.