Skip to content

NETFLIX Tier 1

Read original ↗

Netflix — Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

Netflix's Content Engineering org introduces UDA — Unified Data Architecture — an in-house knowledge-graph platform that sits between business concepts and the many data systems where those concepts live (Enterprise GraphQL Gateway, asset management, media computing, Data Mesh, Iceberg warehouse, …). The bet: model core concepts like actor or movie once, as data in a knowledge graph, then project those definitions outward — generating GraphQL / Avro / SQL / RDF / Java schemas, enforcing consistency across systems, and powering discovery — so that the conceptual model "becomes part of the control plane".

Authors: Alex Hutter, Alexandre Bertails, Claire Wang, Haoyuan He, Kishore Banala, Peter Royal, Shervin Afshar.

Summary

Netflix's products have expanded from film / series into games / live events / ads. The same business entities (actor, movie, asset) are re-modelled by every system that touches them — in slightly different, uncoordinated ways. That produces duplicated and inconsistent models, inconsistent terminology, data-quality discrepancies across services, and effectively-zero connectivity across systems. UDA's response is to treat this as a data-integration + semantic-integration problem at the same time: build a knowledge graph whose nodes are business concepts, whose edges connect concepts to the data containers that hold instance data (GraphQL type resolvers, Data Mesh sources, Iceberg tables, Java APIs), and whose schema languages (GraphQL, Avro, SQL, RDF, Java) are generated from a single upstream domain model via a transpiler family. The metamodel driving all of this — "the model for all models" — is Upper, a bootstrapping self-referencing upper ontology built on a carefully restricted subset of W3C semantic technologies (RDF + RDFS + OWL + SHACL). Two production consumers are named: PDM (Primary Data Management) for authoritative reference data and taxonomies, and Sphere, a self-service operational-reporting tool that walks the knowledge graph to generate SQL against the warehouse.

Key takeaways

  1. "Model once, represent everywhere" is the thesis. "We need new foundations that allow us to define a model once, at the conceptual level, and reuse those definitions everywhere. But it isn't enough to just document concepts; we need to connect them to real systems and data. And more than just connect, we have to project those definitions outward, generating schemas and enforcing consistency across systems. The conceptual model must become part of the control plane." This is UDA's headline design pattern (patterns/model-once-represent-everywhere) — the conceptual model is promoted from documentation artifact to control-plane artifact (concepts/control-plane-data-plane-separation). (Source: sources/2025-06-14-netflix-model-once-represent-everywhere-uda)

  2. Four pain points motivated UDA. "Duplicated and Inconsistent Models … Inconsistent Terminology … Data Quality Issues … Limited Connectivity" — within systems, relationships are constrained by what each system supports; across systems, "they are effectively non-existent." Identifiers and foreign keys "exist, they are inconsistently modeled and poorly documented, requiring manual work from domain experts to find and fix any data issues."

  3. UDA is a knowledge graph unifying catalog + schema registry. "We needed a data catalog unified with a schema registry, but with a hard requirement for semantic integration. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a knowledge graph approach." New framing on the wiki: the knowledge graph is both the schema registry and the data catalog. Contrast with Dropbox Dash's knowledge-graph framing (concepts/knowledge-graph, previously documented wiki instance): Dash uses the graph as a retrieval relevance substrate for agents; UDA uses it as an enterprise data-integration substrate for schemas and pipelines.

  4. RDF + SHACL chosen — with caveats. UDA picked "RDF and SHACL as the foundation for UDA's knowledge graph" but enumerates four operational gaps:

  5. "RDF lacked a usable information model." Standard follow-your-nose mechanisms like owl:imports apply only to ontologies, not to named graphs — UDA needed a generalised dependency/resolution mechanism.
  6. "SHACL is not a modeling language for enterprise data." Designed to validate native RDF, it assumes globally unique URIs + one data graph; enterprise data is structured around local schemas and typed keys (as in GraphQL / Avro / SQL).
  7. "Teams lacked shared authoring practices." Subtle style differences broke semantic interoperability + made transpilation inconsistent.
  8. "Ontology tooling lacked support for collaborative modeling." Unlike GraphQL Federation, ontology frameworks had no modular contribution, team ownership, or safe federation primitives. Authors of domain models "found the tools and concepts unfamiliar."

  9. Named-graph-first information model. "UDA adopts a named-graph-first information model. Each named graph conforms to a governing model, itself a named graph in the knowledge graph." This is the structural primitive that gives UDA resolution, modularity, and governance across the entire graph — the missing layer on top of standard RDF (concepts/named-graph).

  10. Upper is the language + metamodel. "Upper is a language for formally describing domains — business or system — and their concepts … organized into domain models: controlled vocabularies that define classes of keyed entities, their attributes, and their relationships to other entities." Keyed concepts can be organised into taxonomies of types; they can be "contributed monotonically" — attributes + relationships added in a conservative way across domain models. Upper ships "a rich set of datatypes for attribute values, which can also be customized per domain." (systems/netflix-upper, concepts/domain-model.)

  11. Upper is self-referencing / self-describing / self-validating — a bootstrapping upper ontology. "Upper is the metamodel for Connected Data in UDA — the model for all models. It is designed as a bootstrapping upper ontology, which means that Upper is self-referencing, because it models itself as a domain model; self-describing, because it defines the very concept of a domain model; and self-validating, because it conforms to its own model." This is load-bearing: "Upper itself is projected into a generated Jena-based Java API and GraphQL schema used in GraphQL service federated into Netflix's Enterprise GraphQL gateway. These same generated APIs are then used by the projections and the UI." The metamodel is its own first customer (patterns/self-referencing-metamodel-bootstrap, concepts/upper-ontology).

  12. Conservative extension is the composition rule. "Because all domain models are conservative extensions of Upper, other system domain models — including those for GraphQL, Avro, Data Mesh, and Mappings — integrate seamlessly into the same runtime, enabling consistent data semantics and interoperability across schemas." The conservative-extension property is the algebraic guarantee that lets Netflix add system-specific domains without breaking the global semantics — the mathematical analog of backward-compatible schema evolution.

  13. Upper domain models are data. Not code. "Upper domain models are data. They are expressed as conceptual RDF and organized into named graphs, making them introspectable, queryable, and versionable within the UDA knowledge graph." This makes them first-class citizens of every tool that speaks RDF / SPARQL — versioning, diffing, querying domain models is free.

  14. Transpilation: one source → many target schemas. UDA transpiles domain models "into schema definition languages like GraphQL, Avro, SQL, RDF, and Java, while preserving semantics." (patterns/schema-transpilation-from-domain-model.) The same model feeds schema generation and pipeline generation — data movement between containers (e.g. federated GraphQL → Data Mesh, CDC → joinable Iceberg data products) is auto-provisioned from the model + mappings, not hand-plumbed.

  15. Data containers as first-class graph citizens. "Data container representations are data. They are faithful interpretations of the members of data systems as graph data. UDA captures the definition of these systems as their own domain models, the system domains." System domains exist alongside business domains: the same Upper language describes both "what a movie is" and "what a Data Mesh source is". (concepts/data-container.)

  16. Upper extends W3C stack without exposing it. "Upper raises the level of abstraction above traditional ontology languages: it defines a strict subset of semantic technologies from the W3C tailored and generalized for domain modeling. It builds on ontology frameworks like RDFS, OWL, and SHACL so domain authors can model effectively without even needing to learn what an ontology is." The W3C stack is a load-bearing implementation dependency, deliberately hidden behind a simpler façade — the authoring audience is domain experts, not ontologists.

  17. PDM — the first production consumer. "Primary Data Management (PDM) is our platform for managing authoritative reference data and taxonomies. PDM turns domain models into flat or hierarchical taxonomies that drive a generated UI for business users. These taxonomy models are projected into Avro and GraphQL schemas, automatically provisioning data products in the Warehouse and GraphQL APIs in the Enterprise Gateway." PDM is the "reference data + taxonomies" instance of UDA (systems/netflix-pdm).

  18. Sphere — the second production consumer; walks the graph to generate SQL. "Sphere is our self-service operational reporting tool for business users. Sphere uses UDA to catalog and relate business concepts across systems, enabling discovery through familiar terms like 'actor' or 'movie.' Once concepts are selected, Sphere walks the knowledge graph and generates SQL queries to retrieve data from the warehouse, no manual joins or technical mediation required." Canonical wiki instance of patterns/graph-walk-sql-generation — the graph path from concept to data container encodes the join the human would otherwise have to write (systems/netflix-sphere).

  19. Multiple introspection surfaces. "Programmatically introspect the knowledge graph using Java, GraphQL, or SPARQL." Java API (generated from Upper) for embedded use; GraphQL federation for apps; SPARQL for raw graph queries. Same model, three entry points — consistent with UDA's model-once thesis applied to the runtime API surface too.

Architecture at a glance

                    ┌─────────────────────────────────────┐
                    │          Upper (metamodel)          │
                    │  self-referencing · self-describing │
                    │          self-validating            │
                    │  ↳ RDF + RDFS + OWL + SHACL subset  │
                    └──────────────────┬──────────────────┘
                                       │ conservative extension
       ┌───────────────┬──────────────┼──────────────┬───────────────┐
       ▼               ▼              ▼              ▼               ▼
  Business       System domain  System domain  System domain   Mappings
  domain models   for GraphQL    for Avro       for Data Mesh   (concept →
  (actor, movie,                                                 container)
  asset, …)

                                       │ transpile + project
  ┌─────────┐  ┌──────┐  ┌─────┐  ┌─────┐  ┌──────┐  ┌────────┐
  │ GraphQL │  │ Avro │  │ SQL │  │ RDF │  │ Java │  │ Pipelines│
  │ schemas │  │      │  │     │  │     │  │ APIs │  │ (Mesh,  │
  └─────────┘  └──────┘  └─────┘  └─────┘  └──────┘  │  CDC,   │
                                                     │  Iceberg)│
                                                     └────────┘

                 ┌────────────────────┐
                 │   UDA knowledge    │  ← named-graph-first
                 │   graph            │    info model
                 │ (all of the above  │    (each named graph
                 │  AS DATA)          │     conforms to a
                 └─────────┬──────────┘     governing model)
       ┌───────────────────┼────────────────────┐
       ▼                   ▼                    ▼
  ┌─────────┐        ┌──────────┐        ┌──────────┐
  │  PDM    │        │  Sphere  │        │ Anything │
  │ (ref    │        │ (self-   │        │ else —   │
  │  data + │        │  serve   │        │ Java /   │
  │  taxo-  │        │  SQL     │        │ GraphQL /│
  │  nomies)│        │  from    │        │ SPARQL   │
  │         │        │  concepts│        │ intro-   │
  │         │        │  via     │        │ spection)│
  │         │        │  graph   │        │          │
  │         │        │  walk)   │        │          │
  └─────────┘        └──────────┘        └──────────┘

Operational notes

Dimension Observation
Knowledge-graph foundation RDF + RDFS + OWL + SHACL subset, restricted and generalised as Upper
Info model Named-graph-first — every named graph conforms to a governing named graph
Domain-model representation Data — expressed as conceptual RDF, organised into named graphs
Metamodel bootstrap Upper is its own domain model (self-referencing / self-describing / self-validating)
Composition rule All domain models are conservative extensions of Upper
Schema generation targets GraphQL, Avro, SQL, RDF, Java (the transpiler family)
Runtime / API surface Generated Jena-based Java API + federated GraphQL service on the Enterprise GraphQL Gateway + SPARQL
Named production consumers PDM (authoritative reference data + taxonomies) → Avro + GraphQL; Sphere (self-service reporting) → SQL via graph walk
Worked domain example onepiece: — Characters related to Devil Fruit; Devil Fruit has a type. Full turtle definition: github.com/Netflix-Skunkworks/uda/blob/…/onepiece.ttl
Connected systems named Enterprise GraphQL Gateway, Domain Graph Service framework, Data Mesh, Iceberg tables, CDC sources

Caveats and omissions

  • Architecture-overview voice. No fleet sizes, no graph cardinalities, no number of onboarded domains, no SPARQL / Jena / GraphQL QPS, no Upper-runtime latency / memory numbers, no transpiler compile-time numbers. This is the introduction post in a series — the netflix-skunkworks/uda repo is the only concrete artefact linked.
  • Transpiler mechanics undisclosed. The post names the target languages (GraphQL / Avro / SQL / RDF / Java) and says "semantics are preserved", but does not describe the transpiler pipeline, AST layers, conflict resolution when system-domain mappings disagree, or evolution/versioning of generated schemas under domain-model change.
  • Mappings model underspecified. Mappings connect domain concepts to data containers, but the mappings domain model itself (its classes + attributes + Upper extension shape) is mentioned without being shown.
  • Named-graph resolution mechanism unnamed. The post criticises owl:imports for applying only to ontologies + states UDA needed a generalised mechanism, but does not name or describe UDA's replacement.
  • Governance boundaries gestured at, not defined. "Manage ontology ownership, or define governance boundaries" are named as gaps SHACL + vanilla RDF don't cover, but the post doesn't specify UDA's governance + ownership primitives.
  • No adoption metrics. Number of domain models registered, number of systems projected, number of PDM taxonomies shipped, Sphere user count or query volume — all undisclosed.
  • No quantitative lift. No before/after numbers on the four named pain points (duplication / terminology / quality / connectivity). Qualitative wins only.
  • RDF + Jena + SPARQL throughput caveats not addressed. Industry scuttlebutt: SPARQL + triple-stores have historically under-performed property-graph / warehouse queries at scale. How UDA handles this at Netflix's scale isn't discussed (Sphere specifically compiles the walk down to SQL against the warehouse — potentially a deliberate answer to this — but the post doesn't frame it that way).
  • Federation story incomplete. Upper is projected into a GraphQL service federated into the Enterprise GraphQL Gateway, but per-domain ownership, PR workflows, and how conflicting model-change proposals are reconciled across teams are not described.
  • First in a series. The post is explicitly an introduction to UDA's foundations. Information-infrastructure details (UI, authoring, pipelines, production scale) are deferred to later posts.

Source

Last updated · 319 distilled / 1,201 read