Skip to content

NETFLIX 2026-04-04 Tier 1

Read original ↗

Synchronizing the Senses — Powering Multimodal Intelligence for Video Search

Summary

Netflix Search Engineering describes the ingestion and fusion pipeline that turns raw per-frame model output (character recognition, scene detection, etc.) into a searchable multi-modal index over the Netflix media catalog. The pipeline is a decoupled three-stage process: (1) transactional persistence of raw annotations in the Marken Annotation Service backed by Apache Cassandra, (2) offline data fusion triggered by Kafka events that discretizes annotations into fixed-size temporal buckets and computes cross-model intersections, and (3) indexing for real-time search into Elasticsearch as nested documents via composite-key upserts. The architectural point is that decoupling heavy intersection computation from ingest keeps intake responsive, and that bucket-based discretization is what makes multi-model temporal intersection tractable at Netflix catalog scale.

Key takeaways

  1. Three-stage decoupled pipeline: transactional persistence → offline fusion → indexing-for-search. "To ensure system resilience and scalability, the transition from raw model output to searchable intelligence follows a decoupled, three-stage process" (Source).

  2. Marken is the transactional gate: raw per-model annotations land in the Marken annotation service over Cassandra for "data integrity and high-speed write throughput, guaranteeing that every piece of model output is safely captured" (Source). See systems/apache-cassandra for the storage substrate.

  3. Kafka triggers the offline fusion: "the system publishes an event via Apache Kafka to trigger an asynchronous processing job … the offline pipeline handles the heavy computational lifting out-of-band" (Source). Canonical patterns/offline-fusion-via-event-bus instance.

  4. Fixed-size time-bucket discretization: continuous detections are segmented into discrete intervals. The worked example: a "Joey" character span from seconds 2-8 is mapped into seven distinct one-second buckets. A "kitchen" scene from seconds 4-9 overlaps in seconds 4-5, and the system fuses the two into a single record for that bucket (concepts/temporal-bucket-discretization; concepts/multimodal-annotation-intersection).

  5. Enriched records go back to Cassandra: "These newly enriched records are written back to Cassandra as distinct entities. This creates a highly optimized, second-by-second index of multi-modal intersections, perfectly associating every fused annotation with its source asset" (Source).

  6. Elasticsearch indexed via composite-key upsert: "the pipeline executes upsert operations using a composite key (asset ID + time bucket) as the unique document identifier. If a temporal bucket already exists for a specific second of video … the system intelligently updates the existing record rather than generating a duplicate. This mechanism establishes a single, unified source of truth for every second of footage" (Source). Canonical concepts/composite-key-upsert instance.

  7. Nested-document shape enables cross-annotation queries: "the pipeline structures each temporal bucket as a nested document. The root level captures the overarching asset context, while associated child documents house the specific, multi-modal annotation data. This hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale" (concepts/nested-document-indexing; patterns/nested-elasticsearch-for-multimodal-query).

Architecture / numbers

Sample raw scene-search annotation (Figure 2 in the post):

{
  "type": "SCENE_SEARCH",
  "time_range": { "start_time_ns": 4000000000, "end_time_ns": 9000000000 },
  "embedding_vector": [-0.036, -0.33, -0.29, ...],
  "label": "kitchen",
  "confidence_score": 0.72
}

Bucket-mapping steps (Figure 3):

  1. Bucket Mapping — continuous detections segmented into discrete intervals. "Joey" seconds 2-8 → seven one-second buckets.
  2. Annotation Intersection — overlapping buckets from different models fused into a single comprehensive record. "Joey" + "kitchen" co-occur in second 4.
  3. Optimized Persistence — enriched records written back to Cassandra as distinct entities; a second-by-second index.

Sample intersection record (Figure 4):

{
  "associated_ids": {
    "MOVIE_ID": "81686010",
    "ASSET_ID": "01325120-7482-11ef-b66f-0eb58bc8a0ad"
  },
  "time_bucket_start_ns": 4000000000,
  "time_bucket_end_ns": 5000000000,
  "source_annotations": [
    {
      "annotation_id": "7f5959b4-5ec7-11f0-b475-122953903c43",
      "annotation_type": "CHARACTER_SEARCH",
      "label": "Joey",
      "time_range": { "start_time_ns": 2000000000, "end_time_ns": 8000000000 }
    },
    {
      "annotation_id": "c9d59338-842c-11f0-91de-12433798cf4d",
      "annotation_type": "SCENE_SEARCH",
      "time_range": { "start_time_ns": 4000000000, "end_time_ns": 9000000000 },
      "label": "kitchen",
      "embedding_vector": [0.9001, 0.00123, ...]
    }
  ]
}

Elasticsearch document shape (Figure 5): nested — root asset context, child documents for per-modality annotation data.

Systems and concepts extracted

Caveats

  • Architecture-overview voice — no scale numbers (annotations/sec, fleet size, ES shard count, bucket cardinality), no latency percentiles, no concrete bucket-size disclosure (the one-second bucket is a worked-example convention; production-bucket size is not stated).
  • Cassandra schema underneath Marken is referenced via annotation service link to the 2021 "Scalable Annotation Service: Marken" post and the "high-availability pipelines" link to the 2020 "Data ingestion pipeline with operation management" post but not re-disclosed here.
  • Fusion-job scheduling (batch cadence, backfill semantics, re-run behaviour on model-version change) not specified. The composite-key upsert implies idempotent re-indexing but cross- bucket atomicity of a model re-run isn't described.
  • Nested-document cost — Elasticsearch nested documents carry a known query-cost penalty vs denormalized parent-child; the post doesn't discuss the trade-off or shard sizing.
  • Query side is out of scope — this post describes ingest / fusion / indexing; how the multimodal query language surfaces ("find scenes with Joey in a kitchen") is deferred.
  • Embedding vectors appear in sample annotations (scene embedding, potentially other modalities) but the post doesn't describe how vector search integrates with the temporal-bucket text-filter indexing — whether vectors live on the nested child doc for ANN, or are retrieved separately by asset and joined.
  • Relationship to MediaFM (2026-02-23 Netflix ingest) is not explicit — MediaFM produces shot-level multi-modal embeddings; this post describes bucket-level intersection of per-model annotations. The two almost certainly share upstream model output but Netflix doesn't draw the line in either post.

Source

Last updated · 319 distilled / 1,201 read