Skip to content

SYSTEM Cited by 1 source

Netflix MDS (Metadata Service / Model Lifecycle Graph)

MDS is Netflix's centralized metadata service for the ML lifecycle. It unifies fragmented metadata across previously-siloed ML source systems (Pipeline Orchestration, Model Registry, Feature Store, Experimentation Platform, Datasets, Identity Platform) into a single queryable knowledge graph of models, features, pipelines, datasets, A/B tests, and owners. Surfaces to ML practitioners through the AIP Portal.

First documented in the 2026-05-04 Netflix TechBlog post sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph.

Five-stage ingestion pipeline

[Source systems] --thin events--> Kafka / SNS+SQS
       |
       v
[1. Event Ingestion]   schema-validate the event
       |
       v
[2. Entity Enrichment] hydrate from source-of-truth API
       |
       v
[3. Normalization]     map to AIP URI + tag schema
       |
       v
[4. Storage]           Datomic (sync) + Elasticsearch (sync)
       |
       v
[5. Async Graph Formation]  background jobs walk multi-hop
                            paths, materialize reified edges,
                            re-index into Elasticsearch

1. Event ingestion

Source systems emit thin events containing only an identifier and event type:

{
  "event_type": "model_instance_created",
  "instance_id": "ranking-model-v5-20XX0101"
}

The event bus is Kafka + AWS SNS/SQS (the post is explicit: "MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time"). Each source system has dedicated event handlers in MDS:

  • Pipeline Orchestration — pipeline execution events, node definitions, schedules, requests, job attempts.
  • Model Registry — model deployments, configurations, version updates.
  • Feature Store — feature definitions and versions.
  • Experimentation Platform — A/B test configurations and allocations.
  • Datasets — ML datasets and versions.
  • Identity Platform — ownership and team membership.

Producers stay simple: "Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements." Canonical instance of concepts/notification-of-change-event — the event stream is a change-notification trigger, not a state log.

2. Entity enrichment (source-of-truth hydration)

For each event MDS:

  1. Validates the event schema.
  2. Calls back to the source system's API to fetch the complete current state.
  3. Transforms the response into a normalized entity.

For a model_instance_created event, MDS calls GET /api/v1/instances/ranking-model-v5-20XX0101 and the registry returns the full descriptor. The structural property:

"This design has a crucial property: the order of events doesn't matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes."

Tradeoff: read amplification on source systems. Mitigations: "deliberate about rate limiting, caching, and backoff in our enrichment workers." Canonical instance of concepts/source-of-truth-hydration / patterns/thin-event-plus-source-hydration.

3. Data transformation and normalization

Source-system schemas are mapped to a unified entity model:

  • Platform-specific IDs become global AIP URIs of form aip://<entity-type>/<source-system>/<source-id> — see concepts/entity-uri-namespace. Examples from the post: aip://model/registry/ranking-model-v5-20XX0101, aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101, aip://user/identity/alice.
  • owner_emails becomes owners with resolved user URIs.
  • labels becomes tags with key-value structure (e.g. {tag: "team", value: "personalization"}).
  • Foreign keys (pipeline_run_id, etc.) become entity references to other AIP URIs.

Without normalization "downstream consumers would need to understand every source system's schema."

4. Storage and indexing

Two stores written synchronously within the event-processing flow:

Datomic — system of record + working dataset

"Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state."

What's stored:

  • All entity attributes as facts.
  • Entity references (foreign keys that may point to entities not yet fully resolved).
  • All relationships as reified edges (added by enrichment processes). See concepts/reified-edge-graph.
  • Entity lifecycle state (which entities are fully enriched vs awaiting hydration).

Used for navigational queries:

  • "Starting from this model instance, show me all upstream datasets and downstream experiments."
  • "Given this feature, list all consuming models and their owning teams."

Elasticsearch — discovery surface

Written immediately after Datomic. Single unified entities index with entityType discriminator + separate owners index. What's indexed:

  • Primary fields: entity name, description, entity type, owner names.
  • Relationship metadata: names of related entities (a model's features, pipelines, A/B tests) stored in related.
  • Tags as key-value pairs: team::personalization, env::production, model.state::released.

Capabilities:

  • Multi-field text search across names, descriptions, tags, related metadata.
  • Relevance boosting: exact name matches score significantly higher.
  • Filter by entity type, ownership, tags, domain attributes.
  • Fuzzy matching for typos / partial queries.

"Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page."

The Datomic+ES split is canonical instance of patterns/dual-store-graph-plus-search-index — find via ES, explore via Datomic.

5. Async knowledge enrichment and graph formation

Scheduled background processes scan Datomic for entities marked uncached or with unresolved references, hydrate from source-of-truth APIs, and materialize edges as new Datomic facts.

Workflow:

  1. Identify candidates — entities marked uncached / partially resolved.
  2. Hydrate relationships — query source-of-truth systems for related entity details.
  3. Materialize edges — write discovered relationships back to Datomic.
  4. Re-index — trigger Elasticsearch indexing for updated entities.
  5. Mark as enriched — update entity status to prevent redundant processing.

This is multi-hop relationship inference. Worked example from the post: connecting a model instance to its A/B tests.

Step Action Edge derived
1 Hydrate model instance → discovers pipeline_run_id Model Instance → Pipeline Run
2 Hydrate pipeline run → discovers ab_test_cells Pipeline Run → A/B Test Cell
3 Hydrate A/B test → resolves test details A/B Test Cell → A/B Test
4 Walk transitive chain + write back direct edge Model Instance ↔ A/B Test

"MDS doesn't just store what it's told; it derives new knowledge by walking the graph in the background."

Canonical instance of patterns/async-graph-enrichment-job + concepts/multi-hop-relationship-materialization.

Staleness budget: "newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness."

The user-facing payoff: unified GraphQL traversal

Pre-MDS, "Which A/B tests are using this model?" required four manual lookups across Model Registry → Pipeline Orchestrator → Experimentation Platform. With MDS:

query {
  model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
    name
    owners { name }
    currentInstance {
      version
      pipeline { name; owners { name } }
      features {
        edges { node { name; data { edges { node { name } } } } }
      }
      associatedAbTests {
        name
        cells { number; name }
      }
    }
  }
}

The reverse query also works: "What models are being tested in experiment 12345?" — bidirectional reified-edge queryability is structural.

Use cases the graph enables

  • Lineage queries — full lineage from training data → production experiments.
  • Impact analysis — which models are affected if I change this feature?
  • Usage discovery — which A/B tests are using this model?
  • Dependency mapping — what data sources does my pipeline transitively depend on?
  • Deprecation planning — which entities are no longer being used and can be retired?

Open challenges (per the post)

  • Tool proliferation — plugin architectures for new ML tools.
  • Domain-specific visualizations — per-entity-type tailored UI (model deployment history, feature lineage, pipeline execution history, dataset versioning timelines).
  • Metadata quality — automated validation when source systems fail to emit events / ownership becomes stale / entities lack descriptions.
  • Advanced relationship inference — recommending features based on similar pipelines; detecting models that serve similar purposes via shared features. "We are in the early stages of exploring these ideas."

Relation to other Netflix systems on the wiki

What is not documented

The post is intentionally architecture-and-discipline. Not disclosed:

  • QPS, daily event volume, entity count, edge count.
  • Source-system rate-limit budgets / hydration-call rate.
  • Enrichment job parallelism / cadence specifics.
  • Datomic cluster topology, Elasticsearch cluster topology.
  • AIP Portal traffic / user count / GraphQL query latency.
  • The actual names of the source systems behind each category.
  • Access-control / multi-tenancy on the metadata graph itself.
  • The "central management service" implementation (referenced but not detailed).

Seen in

Last updated · 542 distilled / 1,571 read