CONCEPT Cited by 1 source
Multimodal annotation intersection¶
Definition¶
Multimodal annotation intersection is the operation of fusing annotations produced by independent models of different modalities (vision, audio, text) that are simultaneously true of the same segment of media, into a single queryable record.
Canonical wiki instance: Netflix's multimodal video-search pipeline fuses overlapping per-model detections — e.g. "character Joey" from a character-recognition model and "kitchen" from a scene- detection model co-occurring at second 4 — into a single record indexed in Elasticsearch (Source: sources/2026-04-04-netflix-powering-multimodal-intelligence-for-video-search).
Why intersect¶
Per-modality models produce annotations in their own vocabularies over their own time ranges. Answering a query like "scenes with Joey cooking in the kitchen" requires combining them:
- "Joey" is a character-recognition output.
- "kitchen" is a scene-detection output.
- "cooking" might be an activity-detection output.
A naïve approach would run three separate queries and intersect asset-ID-and-timestamp lists client-side. This doesn't scale to many modalities or many assets, and it can't express precision at a finer granularity than whole-asset overlap.
Multimodal annotation intersection moves the fusion into the ingest pipeline: after bucket discretization, each bucket collects all annotations true for that bucket across every modality. Query becomes a single filter over the fused record.
Shape of the intersection record¶
Netflix's intersection record (post Figure 4):
{
"associated_ids": {
"MOVIE_ID": "81686010",
"ASSET_ID": "01325120-7482-11ef-b66f-0eb58bc8a0ad"
},
"time_bucket_start_ns": 4000000000,
"time_bucket_end_ns": 5000000000,
"source_annotations": [
{
"annotation_id": "7f5959b4-...",
"annotation_type": "CHARACTER_SEARCH",
"label": "Joey",
"time_range": { ... }
},
{
"annotation_id": "c9d59338-...",
"annotation_type": "SCENE_SEARCH",
"label": "kitchen",
"time_range": { ... },
"embedding_vector": [...]
}
]
}
Key properties:
- Bucket is the anchor: one record per
(asset_id, time_bucket). - Original interval preserved on each
source_annotation— the fusion doesn't destroy the native continuous-time shape. - Heterogeneous payloads: character annotations carry
label; scene annotations additionally carryembedding_vector. The intersection record is the envelope; child-annotation shapes vary per modality.
Ingest vs query altitudes¶
Multimodal annotation intersection happens at ingest in the offline-fusion stage, not at query. This trades ingest cost + index storage for query latency + expressiveness:
- Ingest pays once per bucket per model re-run, bounded by the composite-key upsert's idempotency guarantee.
- Query gets O(1) look-up of all modalities that co-occurred at a given (asset, bucket), plus the nested-document shape for cross-annotation filtering in a single Elasticsearch query.
Seen in¶
- sources/2026-04-04-netflix-powering-multimodal-intelligence-for-video-search — canonical wiki instance. Netflix's three-stage pipeline moves intersection from query time to ingest time, producing a second-by-second multi-modal index with character-recognition and scene-detection examples.
Caveats¶
- Fusion cost grows with the number of modalities and models; each bucket record size grows with the number of co-occurring annotations.
- Re-runs of a single model that update a subset of buckets
require idempotent upsert semantics; Netflix uses
(asset_id, time_bucket)composite-key upsert into Elasticsearch to guarantee this. - Intersection doesn't give semantic fusion — the record has "Joey" and "kitchen" side-by-side but no combined "Joey-cooking-in-kitchen" embedding. That's a different axis (see MediaFM's shot-level multimodal embedding fusion).
- Query-time expressiveness bounded by what survives into the fused record; anything not annotated is unsearchable.
Related¶
- concepts/temporal-bucket-discretization
- concepts/nested-document-indexing
- concepts/composite-key-upsert
- patterns/temporal-bucketed-intersection
- patterns/offline-fusion-via-event-bus
- patterns/three-stage-ingest-fusion-index
- systems/netflix-marken
- systems/netflix-mediafm (different fusion axis — embedding-level, shot-granularity)
- systems/elasticsearch