PATTERN Cited by 1 source
Knowledge graph for MDM modeling¶
The pattern¶
In an MDM project, do not author the logical data model of the golden record directly. Instead, author a knowledge graph whose nodes are:
- System — name of one source system.
- Table — a table from a particular system.
- Column — one column with its schema type info.
- Concept — a business concept (e.g.
Address,Business Partner). - Attribute — a property of a concept.
- Relationship — a directed, typed edge between two concepts.
Domain experts map each source column to a Concept / Attribute / Relationship role, optionally marking the mapping as direct (1:1 copy) or indirect (requires a transformation algorithm). A generator walks the graph and emits both (1) the logical data model of the golden record and (2) the transformation data model per source system.
Why it works¶
- Decouples authoring from output. The manual effort concentrates where expertise lives (domain experts naming / mapping concepts); the mechanical effort (emitting schemas) is automated.
- Scales with column count, not table count. Manual diagramming was intractable "for tens of tables and hundreds of columns"; graph-driven generation scales sub-linearly in human effort.
- Single source of truth for logical model + lineage + transformation. All three deliverables fall out of the same graph. No parallel maintenance of SQL DDL, transformation specs, and lineage docs.
- Visualisable. The graph can be rendered as a diagram (Zalando uses Neo4j for this) that non-technical domain experts can read. See [[patterns/visual-graph-for-business-engineering- alignment]].
Concrete shape at Zalando¶
- Python script walks the graph and outputs SQL DDL for the golden record: one table per Concept (columns = its Attributes + internal ID), one join table per Relationship (FKs = source + target Concept IDs) (sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).
- The same walk also emits the transformation data model: per source system, a list of (source column → target concept/attribute/relationship) entries tagged as direct or indirect.
- Stored + visualised in Neo4j. The visualisation is the primary business-engineering communication artifact.
- Scale: "tens of tables and hundreds of columns" — the point at which manual diagram maintenance breaks down but a single Python generator still suffices.
Trade-offs¶
- Requires upfront investment in the graph model. The Concept / Attribute / Relationship vocabulary must be agreed before domain experts can contribute mappings. Zalando names the coordination requirement: "the exact same name for concepts, attributes, and relationships is required. This is done by cross-referencing system's business concepts and unifying their wording."
- Handles structural mapping, not entity resolution. The pattern answers "which column maps to which concept" but not "does this row in System A correspond to the same business partner as that row in System B" — the match-and-merge problem is orthogonal.
- Design-time pattern, not runtime. The graph is queried by generators at design time. Runtime MDM operations (ingest, consolidation, serving) use the generated schemas directly.
- Property-graph scale limits. At Zalando's tens-of-tables scale, Neo4j handles authoring + viz comfortably. Large enterprise MDM deployments may hit the same concerns Dropbox cited when rejecting graph DBs for agentic retrieval (concepts/knowledge-graph).
Contrast with adjacent patterns¶
- patterns/model-once-represent-everywhere (Netflix UDA) — also uses a knowledge graph as single source of truth for many downstream schemas, but at enterprise scale with RDF + SHACL + named graphs, with a metamodel governing the authoring language. Zalando's property-graph approach is lower-ceremony and scoped to one MDM project.
- patterns/schema-transpilation-from-domain-model (Netflix UDA) — transpiles domain models into GraphQL / Avro / SQL / RDF / Java. Zalando's pattern transpiles into just two targets (logical data model + transformation data model) but embeds a mapping from external source schemas as a first-class node type.
- patterns/mapping-driven-schema-generation — the generalisation: any workflow where the mapping is authoritative and both the target schema and the transformation code are derivatives. Zalando's MDM instance is one concrete realisation.
Seen in¶
- sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition — Zalando's MDM team uses this pattern to generate the logical data model and transformation data model for their consolidated-style MDM tool. First wiki canonical instance.
Related¶
- concepts/knowledge-graph — the structural substrate
- concepts/master-data-management — the problem domain
- concepts/logical-data-model · [[concepts/transformation- data-model]] — the two generated schemas
- concepts/semantic-layer-of-business-concepts — the middle layer the pattern creates
- concepts/data-lineage — the side-effect capability
- systems/zalando-mdm-system — canonical wiki instance
- systems/neo4j — the graph tooling used
- patterns/mapping-driven-schema-generation — the generalised pattern
- [[patterns/visual-graph-for-business-engineering- alignment]] — the communication pattern that pairs with it