Skip to content

PATTERN Cited by 1 source

Graph-walk SQL generation

Pattern

Give analysts a concept-level query interface — "I want actors joined with the movies they appeared in and the territories those movies aired in" — then walk a knowledge graph whose nodes are business concepts + data containers and whose edges are mappings, to generate the SQL that answers the query against the underlying warehouse.

The graph path is the JOIN specification. The user picks the concept endpoints; the system figures out the tables, columns, and join keys.

Netflix names this directly in the UDA post (sources/2025-06-14-netflix-model-once-represent-everywhere-uda):

"Sphere is our self-service operational reporting tool for business users. Sphere uses UDA to catalog and relate business concepts across systems, enabling discovery through familiar terms like 'actor' or 'movie.' Once concepts are selected, Sphere walks the knowledge graph and generates SQL queries to retrieve data from the warehouse, no manual joins or technical mediation required."

Mechanism

The required structural ingredients are:

  1. A concept graph. Business concepts (concepts/domain-model) are named nodes. actor, movie, territory, etc.

  2. A container graph. Storage-layer schemas (concepts/data-container) are named nodes — warehouse tables, Iceberg tables, views, Avro records.

  3. A mapping graph. Edges connect each concept to the container(s) holding its instance data, + annotate each edge with the join keys / ID encodings / type mappings needed to translate concept-level references to SQL expressions.

  4. Graph traversal → SQL algebra. A walk from a set of source concepts through the mapping graph collects:

  5. The relevant tables (container endpoints).
  6. The join predicates (from the mapping edges).
  7. The column expressions (from the concept-to-container attribute mapping). The traversal composes these into a query plan, emitted as SQL against the warehouse.

  8. Execution by the native engine. The generated SQL runs on the warehouse's native execution engine (Iceberg / Spark / Presto / etc.), not against the knowledge graph — the graph is the planner, the warehouse is the executor.

Why this beats classical "semantic layers"

The pattern is a modern-enterprise-scale reimagining of the old BI semantic layer idea (Cognos / BusinessObjects / Looker LookML / dbt Semantic Layer), with a key difference: the semantic layer here is a first-class knowledge graph with shared domain models, not a per-BI-tool proprietary format.

  • One model, many tools. Because the graph is UDA, the same concept-graph powers Sphere, the warehouse, the authoring UI, any future reporting tool.
  • Semantic correctness by construction. Domain concepts + their container mappings have been authored + reviewed in UDA; Sphere cannot accidentally invent a wrong JOIN because the JOIN is dictated by the graph.
  • Access control on edges. Row/column/concept-level permissions can ride on mapping edges without Sphere-side bespoke code.

Contrast with federated SPARQL

An alternative would be to execute the query as SPARQL over the knowledge graph itself, traversing the container instance data as RDF. UDA's Sphere deliberately doesn't do that — it compiles down to SQL against the warehouse. That keeps the warehouse's native performance (columnar storage, vectorised execution, join planning) and sidesteps SPARQL's historical scale limitations. The knowledge graph is a metadata substrate, not an execution substrate — an important design choice for an enterprise-scale deployment.

Caveats

  • Graph quality is the ceiling. If mappings are wrong, ambiguous, or missing, Sphere produces wrong or empty queries. UDA invests heavily in making the mapping graph correct.
  • Ambiguous paths. Multiple walks between the same two concepts mean multiple possible JOINs with different semantics; the user + UI have to pick. The UDA post doesn't cover this disambiguation.
  • Not every query is graph-walkable. Aggregates spanning many concepts + filters on derived attributes may not have a clean graph-walk encoding; complex reports still need hand- written SQL.
  • Performance of the generated SQL. Graph-derived JOINs can produce expensive query plans; planner-side optimisation becomes its own concern.
  • Netflix discloses no numbers. User count, query QPS, generated-SQL complexity profile — all undisclosed.

Seen in

Last updated · 319 distilled / 1,201 read