Skip to content

PATTERN Cited by 1 source

Knowledge graph for MDM modeling

The pattern

In an MDM project, do not author the logical data model of the golden record directly. Instead, author a knowledge graph whose nodes are:

  • System — name of one source system.
  • Table — a table from a particular system.
  • Column — one column with its schema type info.
  • Concept — a business concept (e.g. Address, Business Partner).
  • Attribute — a property of a concept.
  • Relationship — a directed, typed edge between two concepts.

Domain experts map each source column to a Concept / Attribute / Relationship role, optionally marking the mapping as direct (1:1 copy) or indirect (requires a transformation algorithm). A generator walks the graph and emits both (1) the logical data model of the golden record and (2) the transformation data model per source system.

Why it works

  1. Decouples authoring from output. The manual effort concentrates where expertise lives (domain experts naming / mapping concepts); the mechanical effort (emitting schemas) is automated.
  2. Scales with column count, not table count. Manual diagramming was intractable "for tens of tables and hundreds of columns"; graph-driven generation scales sub-linearly in human effort.
  3. Single source of truth for logical model + lineage + transformation. All three deliverables fall out of the same graph. No parallel maintenance of SQL DDL, transformation specs, and lineage docs.
  4. Visualisable. The graph can be rendered as a diagram (Zalando uses Neo4j for this) that non-technical domain experts can read. See [[patterns/visual-graph-for-business-engineering- alignment]].

Concrete shape at Zalando

  • Python script walks the graph and outputs SQL DDL for the golden record: one table per Concept (columns = its Attributes + internal ID), one join table per Relationship (FKs = source + target Concept IDs) (sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).
  • The same walk also emits the transformation data model: per source system, a list of (source column → target concept/attribute/relationship) entries tagged as direct or indirect.
  • Stored + visualised in Neo4j. The visualisation is the primary business-engineering communication artifact.
  • Scale: "tens of tables and hundreds of columns" — the point at which manual diagram maintenance breaks down but a single Python generator still suffices.

Trade-offs

  • Requires upfront investment in the graph model. The Concept / Attribute / Relationship vocabulary must be agreed before domain experts can contribute mappings. Zalando names the coordination requirement: "the exact same name for concepts, attributes, and relationships is required. This is done by cross-referencing system's business concepts and unifying their wording."
  • Handles structural mapping, not entity resolution. The pattern answers "which column maps to which concept" but not "does this row in System A correspond to the same business partner as that row in System B" — the match-and-merge problem is orthogonal.
  • Design-time pattern, not runtime. The graph is queried by generators at design time. Runtime MDM operations (ingest, consolidation, serving) use the generated schemas directly.
  • Property-graph scale limits. At Zalando's tens-of-tables scale, Neo4j handles authoring + viz comfortably. Large enterprise MDM deployments may hit the same concerns Dropbox cited when rejecting graph DBs for agentic retrieval (concepts/knowledge-graph).

Contrast with adjacent patterns

  • patterns/model-once-represent-everywhere (Netflix UDA) — also uses a knowledge graph as single source of truth for many downstream schemas, but at enterprise scale with RDF + SHACL + named graphs, with a metamodel governing the authoring language. Zalando's property-graph approach is lower-ceremony and scoped to one MDM project.
  • patterns/schema-transpilation-from-domain-model (Netflix UDA) — transpiles domain models into GraphQL / Avro / SQL / RDF / Java. Zalando's pattern transpiles into just two targets (logical data model + transformation data model) but embeds a mapping from external source schemas as a first-class node type.
  • patterns/mapping-driven-schema-generation — the generalisation: any workflow where the mapping is authoritative and both the target schema and the transformation code are derivatives. Zalando's MDM instance is one concrete realisation.

Seen in

Last updated · 476 distilled / 1,218 read