CONCEPT Cited by 1 source

Attention-Based Fusion (Multimodal)¶

Attention-based fusion is the multimodal-ML strategy of learning dynamic per-example weights across modalities and over time, typically via self- or cross-attention between modality encoder outputs. Unlike static concatenation or prediction-space averaging, the fusion mechanism itself is learned and context-dependent: the weight a given modality gets can shift per input, per time step, or per attention head.

When it survives production¶

"Time matters (wearables + longitudinal notes, repeated imaging) and interactions are complex." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai.)

The trigger is temporal dynamics + complex cross-modal interaction, neither of which the static fusion strategies handle well:

Temporal alignment — wearables stream continuously, notes are episodic, imaging is rare; attention over time lets the model learn to look at the modality that carries signal at a given timestep.
Interaction learning — cross-attention between modality tokens lets the model learn "when does genomics matter for this phenotype" rather than bake it in.

Tradeoffs called out in ingested sources¶

"Harder to validate; requires careful controls to avoid spurious correlations." (Source: sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai.)

Attention weights are a tempting interpretability surface but a notoriously unreliable one; the validation overhead is larger than for any other fusion strategy because the model's behaviour is conditional on input and hard to summarise.

Contrast with other fusion strategies¶

concepts/early-fusion — static, input-level; no dynamic weighting.
concepts/intermediate-fusion — static, representation- level; often a stepping stone to cross-attention variants.
concepts/late-fusion — static combiner over per- modality predictions; the graceful-degradation choice.

Attention-based fusion is often layered inside intermediate fusion — modality encoders feed a cross-attention layer, which is itself an intermediate-fusion instance.

See patterns/fusion-strategy-selection-by-deployment-reality for the decision-framework framing.

Seen in¶

sources/2026-04-22-databricks-multimodal-data-integration-production-architectures-for-healthcare-ai — Databricks frames attention-based fusion as the right pick "when time matters (wearables + longitudinal notes, repeated imaging) and interactions are complex" and immediately flags the validation burden ("harder to validate; requires careful controls to avoid spurious correlations"). Explicit production caveat: attention is powerful but not free.