Skip to content

CONCEPT Cited by 1 source

Multimodal annotation intersection

Definition

Multimodal annotation intersection is the operation of fusing annotations produced by independent models of different modalities (vision, audio, text) that are simultaneously true of the same segment of media, into a single queryable record.

Canonical wiki instance: Netflix's multimodal video-search pipeline fuses overlapping per-model detections — e.g. "character Joey" from a character-recognition model and "kitchen" from a scene- detection model co-occurring at second 4 — into a single record indexed in Elasticsearch (Source: sources/2026-04-04-netflix-powering-multimodal-intelligence-for-video-search).

Why intersect

Per-modality models produce annotations in their own vocabularies over their own time ranges. Answering a query like "scenes with Joey cooking in the kitchen" requires combining them:

  • "Joey" is a character-recognition output.
  • "kitchen" is a scene-detection output.
  • "cooking" might be an activity-detection output.

A naïve approach would run three separate queries and intersect asset-ID-and-timestamp lists client-side. This doesn't scale to many modalities or many assets, and it can't express precision at a finer granularity than whole-asset overlap.

Multimodal annotation intersection moves the fusion into the ingest pipeline: after bucket discretization, each bucket collects all annotations true for that bucket across every modality. Query becomes a single filter over the fused record.

Shape of the intersection record

Netflix's intersection record (post Figure 4):

{
  "associated_ids": {
    "MOVIE_ID": "81686010",
    "ASSET_ID": "01325120-7482-11ef-b66f-0eb58bc8a0ad"
  },
  "time_bucket_start_ns": 4000000000,
  "time_bucket_end_ns": 5000000000,
  "source_annotations": [
    {
      "annotation_id": "7f5959b4-...",
      "annotation_type": "CHARACTER_SEARCH",
      "label": "Joey",
      "time_range": { ... }
    },
    {
      "annotation_id": "c9d59338-...",
      "annotation_type": "SCENE_SEARCH",
      "label": "kitchen",
      "time_range": { ... },
      "embedding_vector": [...]
    }
  ]
}

Key properties:

  • Bucket is the anchor: one record per (asset_id, time_bucket).
  • Original interval preserved on each source_annotation — the fusion doesn't destroy the native continuous-time shape.
  • Heterogeneous payloads: character annotations carry label; scene annotations additionally carry embedding_vector. The intersection record is the envelope; child-annotation shapes vary per modality.

Ingest vs query altitudes

Multimodal annotation intersection happens at ingest in the offline-fusion stage, not at query. This trades ingest cost + index storage for query latency + expressiveness:

  • Ingest pays once per bucket per model re-run, bounded by the composite-key upsert's idempotency guarantee.
  • Query gets O(1) look-up of all modalities that co-occurred at a given (asset, bucket), plus the nested-document shape for cross-annotation filtering in a single Elasticsearch query.

Seen in

Caveats

  • Fusion cost grows with the number of modalities and models; each bucket record size grows with the number of co-occurring annotations.
  • Re-runs of a single model that update a subset of buckets require idempotent upsert semantics; Netflix uses (asset_id, time_bucket) composite-key upsert into Elasticsearch to guarantee this.
  • Intersection doesn't give semantic fusion — the record has "Joey" and "kitchen" side-by-side but no combined "Joey-cooking-in-kitchen" embedding. That's a different axis (see MediaFM's shot-level multimodal embedding fusion).
  • Query-time expressiveness bounded by what survives into the fused record; anything not annotated is unsearchable.
Last updated · 319 distilled / 1,201 read