Skip to content

NETFLIX 2026-05-04

Read original ↗

Netflix — Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Summary

A 2026-05-04 Netflix TechBlog post describing the Metadata Service (MDS) — Netflix's centralized model lifecycle graph that unifies fragmented ML metadata (models, features, pipelines, datasets, A/B tests, owners) across previously-siloed source systems into a single queryable graph powering the AIP Portal for ML practitioners. The post walks through MDS's five-stage ingestion pipeline using a concrete worked example (linking a model instance to its A/B tests via multi-hop inference): (1) Event Ingestion — source systems (Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, Datasets, Identity Platform) emit "thin events" containing only an identifier + event type via Kafka and AWS SNS/SQS; (2) Entity Enrichment — for each event MDS validates the schema, calls back to the source system's API to fetch the complete current state, and transforms the response into a normalized entity. This source-of-truth hydration design has the load-bearing property that "the order of events doesn't matter. MDS always fetches the latest facts from the source of truth" — making the system robust to dropped or out-of-order events at the cost of additional read load on source APIs (mitigated via deliberate rate-limiting, caching, and backoff in enrichment workers); (3) Data Transformation and Normalization — heterogeneous source schemas are normalized into a unified entity model with standardized fields (platform-specific IDs become global AIP URIs like aip://model/registry/ranking-model-v5-20XX0101; foreign keys become entity references; owner_emails becomes resolved owners URIs; labels becomes tags); (4) Storage and Indexing — normalized entities are written synchronously first to Datomic (which serves both as system-of-record and graph database, leveraging its immutable fact model to store all relationships as reified edges and support continuous edge addition without losing original entity state), then immediately indexed to Elasticsearch (single unified entities index differentiated by entityType field; separate owners index; relevance boosting on exact name matches; tags as key-value pairs like team::personalization, env::production, model.state::released); (5) Knowledge Enrichment and Graph Formation — scheduled background jobs scan Datomic for entities marked uncached or with unresolved references, hydrate relationships from source-of-truth APIs, materialize cross-system edges as new Datomic facts, trigger Elasticsearch re-indexing, and mark entities as enriched to prevent reprocessing. This async relationship inference is what derives multi-hop relationships invisible to any single source system — the post's worked example walks the chain Model Instance → Pipeline Run → A/B Test Cell → A/B Test, materializing a direct Model Instance ↔ A/B Test edge from the transitive walk, then re-indexes so the associatedAbTests GraphQL field on the model resolves in a single query instead of the prior 4-step manual lookup across Model Registry → Pipeline Orchestrator → Experimentation Platform. The split storage rationale: Elasticsearch is the discovery surface (free-text search entry point in the AIP Portal), Datomic is the navigation surface (relationship-heavy multi-hop queries: "starting from this model instance, show me all upstream datasets and downstream experiments"; "given this feature, list all consuming models and their owning teams") — canonical instance of the new patterns/dual-store-graph-plus-search-index pattern. The post acknowledges enrichment is not real-time: "newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it's safe to rely on a particular relationship for debugging or impact analysis." The article closes with four named open challenges: tool proliferation (plugin architectures for new ML tools), domain-specific visualizations (per-entity-type tailored UI), metadata quality (automated validation when source systems fail to emit events or ownership becomes stale), and advanced relationship inference (recommending features based on similar pipelines, detecting models that serve similar purposes). The post is architecture-and-discipline voice: explicit pipeline stages with worked example, named source systems, named storage substrates (Datomic + Elasticsearch), exact GraphQL query shape, but no QPS, no entity count, no fleet size, no enrichment-job latency, no daily-event-volume figure.

Key takeaways

  1. The forcing function is fragmentation across purpose-built source systems. "Source systems are purpose-built and don't know about entities in other domains." The Model Registry doesn't track experiments; the Experimentation Platform doesn't track which pipeline produced a given model; the Pipeline Orchestrator doesn't surface ownership uniformly. Before MDS, answering "Which A/B tests are using this model?" required four manual lookups: "(1) Looking up the model in the Model Registry; (2) Finding which pipeline produced it; (3) Checking the Pipeline Orchestrator for A/B test tags; (4) Querying the Experimentation Platform for test details." MDS's value is not building a new source system but deriving cross-system knowledge that no single source system has — the cross-system relationships are the product (Source).
  2. Thin events + source-of-truth hydration is the canonical ingestion shape. "Source systems emit thin events that include an identifier and an event type." Example: {"event_type": "model_instance_created", "instance_id": "ranking-model-v5-20XX0101"}. "This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements." On receipt, MDS calls back: GET /api/v1/instances/ranking-model-v5-20XX0101 and the registry responds with the full descriptor. The structural property: "the order of events doesn't matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes." Canonical instance of the new patterns/thin-event-plus-source-hydration pattern — structurally distinct from event-sourcing (where the event stream is the state) by virtue of the event stream being purely a change-notification trigger while authoritative state lives in source systems (Source).
  3. The hydration tradeoff is read load on source systems, addressed deliberately. "The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don't overload them." The pattern moves complexity from producers (who would otherwise need to assemble full payloads) to MDS (which absorbs the read amplification + must implement rate-limiting). Producers stay simple; consumer is the integration point. This is a conscious push of integration cost into the metadata layer, exchanging producer-side simplicity for centralized read-side discipline (Source).
  4. Normalization to a global URI namespace is the precondition for graph traversal across systems. "The normalization process standardizes field names and formats. For example, platform-specific IDs become global AIP URIs, owner_emails becomes owners with resolved user URIs, and labels become tags. Foreign keys like pipeline_run_id are transformed into entity references." Example URI: aip://model/registry/ranking-model-v5-20XX0101. The URI scheme aip://<entity-type>/<source-system>/<source-id> encodes both the entity's source system (so MDS can hydrate from the right API) and a globally-unique handle (so cross-system edges are pointers, not opaque strings). Without this step, downstream consumers would "need to understand every source system's schema" — normalization makes queries and relationships work across all entity types uniformly. Canonical instance of the concepts/entity-uri-namespace primitive (Source).
  5. Datomic's immutable-fact model is load-bearing for the graph layer. "Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state." What's stored: "All entity attributes as facts; Entity references (foreign keys that may point to entities not yet fully resolved); All relationships as reified edges (added by enrichment processes); Entity lifecycle state (tracking which entities are fully enriched vs awaiting hydration)." The reified edges point is the architectural payoff: relationships aren't computed from entity attributes, they're first-class facts that enrichment jobs append to the same fact store. This enables "complex graph traversals (Navigate from a model to its features to their data sources in a single query)", "entity relationships (Join across multiple domains without N+1 query problems)", "flexible schema evolution (Easy to add new entity types and attributes as the catalog grows)", and "progressive enrichment (Background jobs efficiently identify and process entities requiring additional hydration, enabling gradual graph completion without reprocessing fully enriched entities)." Canonical instance of reified edges in an immutable-fact substrate (Source).
  6. Two stores, two access patterns — discovery vs navigation. "In practice, we use Datomic for relationship-heavy, navigational queries such as: Starting from this model instance, show me all upstream datasets and downstream experiments. Given this feature, list all consuming models and their owning teams. These queries often span multiple hops in the graph and benefit from Datomic's immutable fact model and efficient joins across entity relationships." In contrast: "Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page." The split: ES handles find (typed-name → candidate entity), Datomic handles explore (entity → its neighborhood). Indexing is synchronous on the write path: "Once normalized, entities are persisted to Datomic, which serves as both a local cache and a graph database. Immediately after writing to Datomic, entities are indexed in Elasticsearch." Both stores reflect the same canonical state; ES is a derived view tuned for free-text + faceted query, Datomic for graph traversal. Canonical instance of the new patterns/dual-store-graph-plus-search-index pattern (Source).
  7. Async enrichment is the multi-hop relationship-derivation engine. "Once entity metadata is persisted in Datomic, scheduled background processes take over to discover and materialize relationships." The workflow: "Identify candidates: Find entities marked as uncached or with unresolved references; Hydrate relationships: Query source-of-truth systems to fetch related entity details; Materialize edges: Write discovered relationships back to Datomic; Re-index: Trigger Elasticsearch indexing for updated entities; Mark as enriched: Update entity status to prevent redundant processing." The worked example: when MDS processes a new model instance, the model references a pipeline_run_id. An enrichment job hydrates the pipeline run via GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101 and discovers ab_test_cells, then queries GET /api/v1/tests/12345 for test details, derives the chain Model Instance was produced by Pipeline Run → Pipeline Run was executed for A/B Test Cell #2 → A/B Test Cell #2 belongs to A/B Test "Ranking Model v5 vs v4", and writes the inferred Model Instance ↔ A/B Test edge back to Datomic + triggers re-indexing. "MDS doesn't just store what it's told; it derives new knowledge by walking the graph in the background." Canonical instance of the new patterns/async-graph-enrichment-job pattern + the concepts/multi-hop-relationship-materialization primitive (Source).
  8. Async enrichment has a staleness budget that's surfaced to users. "Because enrichment is asynchronous, newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it's safe to rely on a particular relationship for debugging or impact analysis." The system's honesty about its own staleness is the operator-facing primitive: rather than pretending edges are real-time, MDS makes the freshness gradient queryable per-entity, letting practitioners decide whether the data is fresh enough for their use case (debugging an active incident vs planning a deprecation). Distinct from typical CDC pipelines that hide replication lag inside the pipe; here the lag is a first-class metadata field (Source).
  9. The unified GraphQL query is the user-facing payoff. Before MDS, the "Which A/B tests are using this model?" query was a 4-system manual walk. With MDS, the same query is a single GraphQL traversal:
    query {
      model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
        name
        owners { name }
        currentInstance {
          version
          pipeline { name; owners { name } }
          features {
            edges { node { name; data { edges { node { name } } } } }
          }
          associatedAbTests {
            name
            cells { number; name }
          }
        }
      }
    }
    
    "The reverse query also works: 'What models are being tested in experiment 12345?'" The bidirectionality is structural — reified edges in Datomic are queryable from both endpoints, so neither direction has a preferred axis. The query shape collapses what was an N-system orchestration problem into a single graph walk against MDS — the canonical worked example of why a unified metadata graph beats federated query across source-system APIs (Source).
  10. Single ES entities index + per-domain entityType discriminator is the indexing decision. "Single entities index: All entity types (models, features, pipelines, etc.) are indexed in one unified index, differentiated by the entityType field. Separate owners index: Dedicated index for users and groups to enable cross-entity owner searches." The single-index choice is the search-side dual of the unified URI namespace: a free-text search across all entity types is one query against one index, with the entityType filter narrowing post-search. Owners get a separate index because "cross-entity owner searches""who owns anything called ranking" — is a different query shape than "find entities matching name ranking-v5". Tags are stored as key-value pairs (team::personalization, env::production, model.state::released) so a single tag query can filter across any entity type uniformly. Relevance boosting ensures "exact name matches score significantly higher" than fuzzy/related-metadata matches — the canonical search-quality lever for catalog UX (Source).

Architectural numbers + operational notes (from source)

  • No QPS / no event volume / no entity count disclosed. The post is architecture-driven; quantitative production data is absent.
  • Pipeline stages: 5 — (1) Event Ingestion → (2) Entity Enrichment → (3) Data Transformation and Normalization → (4) Storage and Indexing → (5) Knowledge Enrichment and Graph Formation.
  • Source systems integrated (named): Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, Datasets, Identity Platform — six categories of upstream metadata producers.
  • Event ingestion substrate: Kafka + AWS SNS/SQS"MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time."
  • Storage substrate (named): Datomic (graph + system-of-record) + Elasticsearch (search + discovery).
  • Datomic role split: "both the system of record for MDS and the working dataset for enrichment processes" — single fact store, two access patterns (synchronous writes from ingestion + asynchronous edge appends from enrichment).
  • Elasticsearch index layout: Single entities index (all types, entityType discriminator) + separate owners index. Tags as key-value pairs.
  • Enrichment latency: "typically minutes rather than seconds" between underlying entity creation and discovered-relationship visibility. Last-enriched timestamp is surfaced in the AIP Portal.
  • Worked-example URIs (illustrative, not literal production IDs): aip://model/registry/ranking-model-v5-20XX0101; aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101; aip://user/identity/alice.
  • Worked-example A/B test ID: 12345, cell #2, cell name treatment_ranking_v5, control control_ranking_v4.
  • Open challenges (named): tool proliferation (plugin architecture); domain-specific visualizations (per-entity-type UI); metadata quality (automated validation + anomaly detection); advanced relationship inference (recommending features, detecting similar-purpose models).
  • Acknowledged authors (Netflix AI Platform org): Emma Carney, Megan Ren, Nadeem Ahmad, Pat Olenik, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu.
  • Nothing disclosed about: production fleet size, number of entities in the graph, number of edges, daily event throughput, enrichment-job parallelism, hydration-API call rate against source systems, source-system rate-limit budgets, AIP Portal traffic, user count, GraphQL query latency, Datomic cluster topology, Elasticsearch cluster topology, deployment platform, staleness-timestamp UI shape, exact enrichment job cadence, retry/backoff policy specifics, the "central management service" implementation.

Systems extracted

New wiki pages:

  • systems/netflix-mds — Netflix's centralized Metadata Service / Model Lifecycle Graph. Five-stage ingestion pipeline (events → enrichment → normalization → dual-store → async graph formation). Consumes thin events from Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, Datasets, Identity Platform via Kafka + SNS/SQS. Hydrates from source-of-truth APIs. Stores normalized entities in Datomic (graph) + Elasticsearch (search). Async background jobs materialize multi-hop relationships as reified edges. Canonical wiki home for Netflix's ML metadata graph; surfaces to ML practitioners via AIP Portal.
  • systems/netflix-aip-portal — Netflix's AI Platform Portal, the unified UI surface for the ML lifecycle graph. Free-text search box (backed by Elasticsearch) → entity page (description + owners + domains + tags + relationships panel) → click-through navigation to neighbors (upstream datasets, downstream experiments, sibling model versions). New entity types automatically get baseline search + entity pages + relationship navigation; domain-specific visualizations layered on top per type. First canonical wiki page.
  • systems/datomicImmutable-fact relational/graph database developed by Cognitect (Rich Hickey). Stores all data as immutable facts with time-travel queries. Used at Netflix MDS as the system-of-record + working dataset for enrichment, with relationships modeled as reified edges. Canonical wiki home for Datomic with the immutable-fact + reified-edge framing.

Extended (cross-link added):

  • systems/kafka — adds Netflix MDS as another canonical wiki instance of Kafka as the change-notification substrate for thin events feeding a downstream metadata service. Reinforces Kafka's role as the event-bus substrate for thin-event + source-hydration architectures.
  • systems/aws-sns — adds Netflix MDS as another upstream consumer ingesting via SNS-fanout for source-system change events.
  • systems/aws-sqs — adds Netflix MDS as another consumer ingesting source-system change events via SQS for queue-buffered processing.
  • systems/elasticsearch — adds Netflix MDS as a canonical wiki instance of the single-index + entityType-discriminator indexing pattern, with key-value tags and exact-match relevance boosting for catalog/discovery UX.

Concepts extracted

New wiki pages:

  • concepts/notification-of-change-event — the architectural framing of an event stream as a sequence of change-notifications, not a log of state changes. The event carries only a change identifier; the consumer hydrates the actual state from the source of truth. Distinguishing property from event-sourcing: event order doesn't matter; replay/replay-from-zero converges on whatever the source of truth says now.
  • concepts/source-of-truth-hydration — the ingestion-side primitive of calling back to the source system on each event to fetch the complete current state, rather than embedding state in the event payload. The decoupling property: the event stream becomes a trigger; correctness is anchored in source APIs. Tradeoff: read amplification on source systems.
  • concepts/async-relationship-inference — the metadata-graph primitive of deriving cross-system relationships in background jobs rather than computing them at ingestion time. Enables progressive graph completion, multi-hop edge derivation, retry on transient hydration failures, and bounded fan-out from any single ingestion event.
  • concepts/multi-hop-relationship-materialization — the discipline of walking N-step paths in a graph and writing back direct N=1 edges as new facts, so future queries hit a single edge instead of an N-hop walk. Trades enrichment-time compute for query-time latency.
  • concepts/reified-edge-graph — the data-modeling pattern of treating relationships as first-class records in an immutable fact store, instead of computing them from entity attributes or storing them as graph-DB native edges. Enables continuous edge addition without losing original entity state, schema evolution, and provenance tracking on relationships.
  • concepts/entity-uri-namespace — the catalog-design primitive of assigning every entity a globally-unique URI of the form <scheme>://<entity-type>/<source-system>/<source-id>. Encodes both routing (which API to hydrate from) and identity (cross-system reference). Precondition for cross-system graph traversal.

Extended (cross-link added):

  • concepts/knowledge-graph — adds Netflix MDS as a canonical wiki instance of a knowledge graph used for ML asset lineage + impact analysis rather than retrieval ranking. Distinct from Dropbox Dash's retrieval-substrate framing.
  • concepts/data-lineage — adds Netflix MDS as a canonical wiki instance of a lineage system built from change events + source-of-truth hydration, distinct from static-analysis-driven lineage (Meta Zoncolan) and runtime-trace-driven lineage.

Patterns extracted

New wiki pages:

  • patterns/thin-event-plus-source-hydration — the canonical event-bus pattern where producers emit minimal change-notification events and consumers hydrate full state from source-of-truth APIs. Resilient to dropped/out-of-order events. Trades source-API read amplification for producer simplicity + state-consistency decoupling.
  • patterns/async-graph-enrichment-job — the canonical pattern of using scheduled background jobs to discover, materialize, and re-index cross-entity relationships in a metadata graph. Identifies uncached/unresolved entities, hydrates from source APIs, writes new edges to the fact store, triggers search re-indexing, marks entities as enriched to prevent reprocessing.
  • patterns/dual-store-graph-plus-search-index — the canonical storage pattern for catalog/metadata services that need both free-text discovery (Elasticsearch / search engine) and multi-hop relationship navigation (graph DB / immutable fact store). Synchronous write to both on the ingestion path; ES is a derived view; both stores reflect the same canonical state.

Caveats

  1. Architecture-and-discipline post, not a measurement post. No QPS, no entity count, no fleet size, no daily event throughput, no enrichment-job parallelism, no enrichment-cycle latency in absolute terms (only "minutes rather than seconds" qualitatively). The five-stage architecture and the worked example are the post's contribution; quantitative validation is absent.
  2. Source systems are referenced by category, not by name. "Pipeline Orchestration", "Model Registry", "Feature Store", "Experimentation Platform", "Datasets", "Identity Platform" — these are categorical names. The actual Netflix systems behind each are not disclosed (e.g., is the Pipeline Orchestration system Maestro? — likely, but not stated). Inference from other Netflix posts is plausible but the article doesn't name them.
  3. The worked example uses placeholder IDs (ranking-model-v5-20XX0101, train-weekly-ranking-20XX0101, A/B test 12345). These are illustrative, not production identifiers; the 20XX suffix is a deliberate redaction.
  4. No discussion of source-system rate-limit handling specifics. The post acknowledges "we need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don't overload them" but doesn't quantify the rate-limit budgets, retry policies, or how MDS handles a source system being down.
  5. No discussion of the AIP Portal as a system in itself. The portal is described as the consumer surface but its rendering layer, framework, deployment topology, and traffic shape are not detailed.
  6. No discussion of multi-tenancy / access control in the metadata graph itself. Owners and tags are surfaced as data, but the post doesn't address whether MDS imposes any access-control policy on graph reads (can any practitioner see any model's lineage? Or are there confidentiality boundaries?).
  7. The "Open Challenges" section is forward-looking, not retrospective. Tool proliferation, domain-specific viz, metadata quality, and advanced relationship inference are explicitly framed as "the future opportunities for us" — they are not currently solved problems.
  8. Datomic adoption rationale is implicit, not justified against alternatives. The post asserts Datomic's immutable fact model is well-suited but doesn't compare against Neo4j, JanusGraph, TigerGraph, Dgraph, or a hand-rolled graph layer on Cassandra/Postgres. The choice may be driven by team familiarity (the AI Platform org may have prior Datomic experience) more than against-alternatives benchmarking.

Source

Last updated · 542 distilled / 1,571 read