Skip to content

PATTERN Cited by 1 source

Tri-modal embedding fusion

Intent

Fuse three modalities (video + audio + text) into a single per-unit embedding by running each modality through its own pre-trained encoder, concatenating the three output vectors, and unit-normalising — producing a fixed-dimensional fused representation that a downstream sequence model can then contextualise across time.

The pattern is a pragmatic shape for multimodal representation learning when (a) you have strong per-modality encoders already available, and (b) the downstream sequence model will do the heavy lifting of cross-modal + cross-time interaction.

Canonical instance — Netflix MediaFM

MediaFM uses tri-modal fusion per-shot (Source: sources/2026-02-23-netflix-mediafm-the-multimodal-ai-foundation-for-media-understanding):

shot_i:
  frames → SeqCLIP                    → v_i
  audio  → wav2vec2                   → a_i
  text   → text-embedding-3-large     → t_i  (zero if absent)

  fused_i = unit_norm( concat(v_i, a_i, t_i) )   — 2304 dims

The three encoders are treated as frozen feature extractors; MediaFM trains only the Transformer layer sitting on top of the fused per-shot sequence. The fused 2304 dims are then projected down to the Transformer's hidden dimension by a linear input layer before entering the self-attention stack.

Mechanism details

  1. Independent per-modality encoders. Each modality has its own pre-trained model with its own output dimensionality. Netflix's MediaFM pulls three from three different places: internal (SeqCLIP), Meta FAIR (wav2vec2), OpenAI API (text-embedding-3- large). The three are not co-trained; their output spaces are independent.
  2. Concatenation, not addition. Concatenation preserves each modality's full signal at known index ranges in the fused vector. Addition (or any summing fusion) would require dimension-matching + would conflate signals from different modalities into the same coordinate — losing the model's ability to route by modality.
  3. Unit-normalisation of the concatenation. Puts all fused vectors on a unit sphere — downstream cosine-distance losses (MediaFM uses cosine distance for its MSM objective) behave predictably; scale differences between modality sub-vectors are muted.
  4. Zero-padding for missing modality. A modality absent for a given unit is zero-padded to preserve concatenation shape. Netflix uses this for timed-text (absent in shots without dialogue); video + audio are always present.
  5. Linear projection to hidden dim before the sequence model consumes the fused vector. The 2304 → h reduction is trainable; the model learns which fused-vector coordinates to attend to.

Contrast with other multimodal fusion shapes

Shape Per-modality param sharing Cross-modal interaction when Canonical instance
Concat + unit-norm + downstream transformer (MediaFM) None (frozen encoders) In the downstream Transformer MediaFM
Late attention fusion (cross-attention between modality streams) None In dedicated cross-attention layers Perceiver IO, Flamingo
Early fusion (concat raw tokens, shared encoder) Shared from the start Everywhere in the encoder PaLI, VideoLLaMA
Contrastive alignment (separate encoders, shared embedding space) None; aligned via loss Implicitly via distance CLIP, ALIGN
Late fusion for classification (per-modality classifier + voting) None Only at the output Per-modality-then-ensemble baselines

Tri-modal fusion as Netflix does it is a "fuse at the input, contextualise in the encoder" choice — cheap to set up (no co-training), defers all the heavy lifting to the downstream Transformer.

Trade-offs

  • Win — leverage existing pre-trained encoders. No need to train multimodal encoders from scratch; Netflix's three come from three different sources with no co-training.
  • Win — modality-independent upgrades. Swap wav2vec2 for a newer audio encoder without retraining the video / text parts; MediaFM's upstream sub-encoder is a dependency-inject at the architectural level.
  • Cost — no learned cross-modal alignment at the input. If the three encoders' output spaces have wildly different geometry, the fused vector is a hodgepodge, and the Transformer has to learn both per-modality representations and cross-modal interactions from scratch. Unit-normalisation partially mitigates but doesn't eliminate this.
  • Cost — fixed dimensionality allocation. Each modality's contribution is fixed at pipeline design time (SeqCLIP's output dim + wav2vec2's + text-3-large's = 2304 in MediaFM). Shifting relative importance requires re-choosing encoders.
  • Cost — missing-modality handling is crude. Zero-padding is operational but doesn't inform the model that the zero means "modality absent" vs "modality zero-valued"; modality-specific attention masks or gating is more principled, not used in MediaFM.
  • Cost — linear projection is a bottleneck. The 2304 → h projection is a single linear layer; significant information compression happens here and is not modality-aware.

Why MediaFM can get away with it

The reason Netflix's pragmatic choice works is that the downstream Transformer is powerful enough to sort out the per-modality contribu- tions after the fact. MSM pre-training over 512-shot sequences gives the model plenty of capacity + signal to learn which fused- vector coordinates matter for which downstream tasks. If the downstream model were smaller or the pre-training objective weaker, the fusion shape would matter more.

Implementation checklist

  • Choose three (or more) pre-trained encoders, one per modality.
  • Profile each one's output dimensionality + compute cost.
  • Decide on a missing-modality fallback (zero-pad, null-embedding, gating). Netflix chose zero-pad; others in literature use learnable null tokens.
  • Concatenate + unit-normalise → fused vector.
  • Add a linear projection to the downstream model's hidden dim.
  • Train only the downstream model on top.

When this fits

  • Pre-trained encoders exist per modality. SeqCLIP / wav2vec2 / text-embedding-3-large covers all three for MediaFM; swap in whatever is best per modality.
  • Downstream sequence model is capacity-rich. A BERT-style or larger Transformer that can learn per-modality features + cross- modal attention from the fused input.
  • Modalities have natural per-unit alignment. In MediaFM, a shot's video + audio + timed-text are naturally co-temporal; stitching them per-shot is a clean boundary.

When it doesn't fit

  • No strong per-modality pre-trained encoder exists — then start with contrastive alignment or full co-training.
  • Modalities have wildly different temporal rates or granularities that don't pack into a common unit (shot, sentence, etc.).
  • Downstream model is small and can't absorb the burden of learning modality-routing from scratch.
  • Need for learned cross-modal interactions at fusion time — cross-attention architectures (Flamingo, Perceiver) are more expressive.

Caveats

  • Netflix does not disclose how the per-modality output sub-vectors are sized (SeqCLIP's output + wav2vec2's output + OpenAI's output sum to 2304 but individual breakdown isn't given). Different breakdowns have different implications for each modality's effective contribution.
  • Pooling within-modality (e.g. wav2vec2 over a shot's audio, SeqCLIP over frames) is an upstream design choice the post doesn't characterise.
  • Ablations that would resolve the above (swap-out-a-modality, zero- out-a-modality-at-inference) are not reported — MediaFM's reported ablation compares tri-modal-context vs tri-modal-flat vs video-only-flat, not modality-wise.

Seen in

Last updated · 319 distilled / 1,201 read