NETFLIX 2026-04-04 Tier 1

Synchronizing the Senses — Powering Multimodal Intelligence for Video Search¶

Summary¶

Netflix Search Engineering describes the ingestion and fusion pipeline that turns raw per-frame model output (character recognition, scene detection, etc.) into a searchable multi-modal index over the Netflix media catalog. The pipeline is a decoupled three-stage process: (1) transactional persistence of raw annotations in the Marken Annotation Service backed by Apache Cassandra, (2) offline data fusion triggered by Kafka events that discretizes annotations into fixed-size temporal buckets and computes cross-model intersections, and (3) indexing for real-time search into Elasticsearch as nested documents via composite-key upserts. The architectural point is that decoupling heavy intersection computation from ingest keeps intake responsive, and that bucket-based discretization is what makes multi-model temporal intersection tractable at Netflix catalog scale.

Key takeaways¶

Three-stage decoupled pipeline: transactional persistence → offline fusion → indexing-for-search. "To ensure system resilience and scalability, the transition from raw model output to searchable intelligence follows a decoupled, three-stage process" (Source).
Marken is the transactional gate: raw per-model annotations land in the Marken annotation service over Cassandra for "data integrity and high-speed write throughput, guaranteeing that every piece of model output is safely captured" (Source). See systems/apache-cassandra for the storage substrate.
Kafka triggers the offline fusion: "the system publishes an event via Apache Kafka to trigger an asynchronous processing job … the offline pipeline handles the heavy computational lifting out-of-band" (Source). Canonical patterns/offline-fusion-via-event-bus instance.
Fixed-size time-bucket discretization: continuous detections are segmented into discrete intervals. The worked example: a "Joey" character span from seconds 2-8 is mapped into seven distinct one-second buckets. A "kitchen" scene from seconds 4-9 overlaps in seconds 4-5, and the system fuses the two into a single record for that bucket (concepts/temporal-bucket-discretization; concepts/multimodal-annotation-intersection).
Enriched records go back to Cassandra: "These newly enriched records are written back to Cassandra as distinct entities. This creates a highly optimized, second-by-second index of multi-modal intersections, perfectly associating every fused annotation with its source asset" (Source).
Elasticsearch indexed via composite-key upsert: "the pipeline executes upsert operations using a composite key (asset ID + time bucket) as the unique document identifier. If a temporal bucket already exists for a specific second of video … the system intelligently updates the existing record rather than generating a duplicate. This mechanism establishes a single, unified source of truth for every second of footage" (Source). Canonical concepts/composite-key-upsert instance.
Nested-document shape enables cross-annotation queries: "the pipeline structures each temporal bucket as a nested document. The root level captures the overarching asset context, while associated child documents house the specific, multi-modal annotation data. This hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale" (concepts/nested-document-indexing; patterns/nested-elasticsearch-for-multimodal-query).

Architecture / numbers¶

Sample raw scene-search annotation (Figure 2 in the post):

{
  "type": "SCENE_SEARCH",
  "time_range": { "start_time_ns": 4000000000, "end_time_ns": 9000000000 },
  "embedding_vector": [-0.036, -0.33, -0.29, ...],
  "label": "kitchen",
  "confidence_score": 0.72
}

Bucket-mapping steps (Figure 3):

Bucket Mapping — continuous detections segmented into discrete intervals. "Joey" seconds 2-8 → seven one-second buckets.
Annotation Intersection — overlapping buckets from different models fused into a single comprehensive record. "Joey" + "kitchen" co-occur in second 4.
Optimized Persistence — enriched records written back to Cassandra as distinct entities; a second-by-second index.

Sample intersection record (Figure 4):

{
  "associated_ids": {
    "MOVIE_ID": "81686010",
    "ASSET_ID": "01325120-7482-11ef-b66f-0eb58bc8a0ad"
  },
  "time_bucket_start_ns": 4000000000,
  "time_bucket_end_ns": 5000000000,
  "source_annotations": [
    {
      "annotation_id": "7f5959b4-5ec7-11f0-b475-122953903c43",
      "annotation_type": "CHARACTER_SEARCH",
      "label": "Joey",
      "time_range": { "start_time_ns": 2000000000, "end_time_ns": 8000000000 }
    },
    {
      "annotation_id": "c9d59338-842c-11f0-91de-12433798cf4d",
      "annotation_type": "SCENE_SEARCH",
      "time_range": { "start_time_ns": 4000000000, "end_time_ns": 9000000000 },
      "label": "kitchen",
      "embedding_vector": [0.9001, 0.00123, ...]
    }
  ]
}

Elasticsearch document shape (Figure 5): nested — root asset context, child documents for per-modality annotation data.

Systems and concepts extracted¶

systems/netflix-marken — Netflix annotation service; Cassandra-backed; transactional ingest gate.
systems/apache-cassandra — dual role: raw annotations + enriched temporal-bucket records.
systems/kafka — event bus decoupling ingest from fusion.
systems/elasticsearch — real-time search index over nested temporal-bucket documents.
concepts/temporal-bucket-discretization — fixed-size time-bucket encoding of continuous detections.
concepts/multimodal-annotation-intersection — cross-model co-occurrence in a shared bucket.
concepts/composite-key-upsert — (asset_id, time_bucket) as ES document _id for idempotent model re-runs.
concepts/nested-document-indexing — Elasticsearch nested-document shape for child annotation records.
patterns/three-stage-ingest-fusion-index — transactional → offline fusion → search index.
patterns/offline-fusion-via-event-bus — async Kafka event unblocks heavy intersection work from ingest.
patterns/temporal-bucketed-intersection — bucket-then-intersect as the reusable shape for multimodal temporal joins.
patterns/nested-elasticsearch-for-multimodal-query — root-asset + child-annotation nested ES documents.

Caveats¶

Architecture-overview voice — no scale numbers (annotations/sec, fleet size, ES shard count, bucket cardinality), no latency percentiles, no concrete bucket-size disclosure (the one-second bucket is a worked-example convention; production-bucket size is not stated).
Cassandra schema underneath Marken is referenced via annotation service link to the 2021 "Scalable Annotation Service: Marken" post and the "high-availability pipelines" link to the 2020 "Data ingestion pipeline with operation management" post but not re-disclosed here.
Fusion-job scheduling (batch cadence, backfill semantics, re-run behaviour on model-version change) not specified. The composite-key upsert implies idempotent re-indexing but cross- bucket atomicity of a model re-run isn't described.
Nested-document cost — Elasticsearch nested documents carry a known query-cost penalty vs denormalized parent-child; the post doesn't discuss the trade-off or shard sizing.
Query side is out of scope — this post describes ingest / fusion / indexing; how the multimodal query language surfaces ("find scenes with Joey in a kitchen") is deferred.
Embedding vectors appear in sample annotations (scene embedding, potentially other modalities) but the post doesn't describe how vector search integrates with the temporal-bucket text-filter indexing — whether vectors live on the nested child doc for ANN, or are retrieved separately by asset and joined.
Relationship to MediaFM (2026-02-23 Netflix ingest) is not explicit — MediaFM produces shot-level multi-modal embeddings; this post describes bucket-level intersection of per-model annotations. The two almost certainly share upstream model output but Netflix doesn't draw the line in either post.

Source¶

companies/netflix
systems/netflix-marken · systems/apache-cassandra · systems/kafka · systems/elasticsearch
systems/netflix-mediafm (adjacent multimodal work; different altitude)
concepts/temporal-bucket-discretization · concepts/multimodal-annotation-intersection · concepts/composite-key-upsert · concepts/nested-document-indexing
patterns/three-stage-ingest-fusion-index · patterns/offline-fusion-via-event-bus · patterns/temporal-bucketed-intersection · patterns/nested-elasticsearch-for-multimodal-query