Skip to content

ZALANDO 2021-07-28

Read original ↗

Zalando — Knowledge Graph Technologies Accelerate and Improve the Data Model Definition for Master Data

Summary

Zalando describes how they use knowledge graph technologies to accelerate the design of a Master Data Management (MDM) tool — specifically, how to derive the logical data model of the consolidated "golden record" and the transformation data model (per-source-column mappings) from a manually- curated graph of business concepts, attributes, and relationships. The central move is to split the manual work (domain experts map each source column to a business concept, attribute, or relationship) from the generated artifacts (the golden-record schema and the per-system transformation schema, both produced by a Python script walking the graph). The resulting graph — visualised in Neo4j — doubles as a communication artifact with non-technical domain experts who can't read SQL schemas or spreadsheets. The post is scoped to tens of tables and hundreds of columns, early-phase MDM, and a consolidated-style implementation (as opposed to registry-style or coexistence-style MDM). Two advantages claimed: (1) accelerated business-engineering dialogue about the golden-record model, and (2) queryable deliverables (logical model + mapping) over error-prone manual maintenance.

Key takeaways

  1. The knowledge graph is the source-of-truth; the schemas are derivatives. Zalando stores system tables, system columns, business concepts, attributes, and relationships as nodes in a named directed graph. The logical data model and the transformation data model are generated artifacts — "systematically created (via a Python script) from the concepts, attributes, and relationships". Each concept becomes a table (columns = its attributes + internal identifier); each relationship becomes a join table (foreign keys = source and target concept identifiers). This reverses the usual MDM workflow, where the logical model is authored directly and the mappings accumulate as textual sidecars (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  2. Six node types, three manual, three derivable. The graph schema is:

  3. System — the name of one source system.

  4. Table — a table from a particular system.
  5. Column — one column of a table, with its data type.
  6. Concept — a business concept (e.g. Address, Business Partner).
  7. Attribute — a property of a concept (e.g. street name on Address).
  8. Relationship — a directed edge between two concepts (e.g. Business Partner "has contact" Address).

The first three (System / Table / Column) are derivable from the source system schemas. The latter three (Concept / Attribute / Relationship) are the manual contribution — authored by domain experts. Every column is mapped to at least one concept and optionally to an attribute or relationship direction (source or target) (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  1. Mappings are either direct or indirect. A column maps directly (one-to-one) to a concept's attribute when no transformation is required (e.g. System B's zip_codeAddress.postal code). A column maps indirectly (one-to-many) when ingestion requires a transformation algorithm (e.g. System A's address_line_1, _2, _3 must be parsed into structured street / city / postal-code attributes). The graph labels both edge types, which lets the transformation-data-model generator emit either a 1:1 column copy or a call into a transformation function per source column (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  2. Neo4j is used for visualisation, not storage semantics. Zalando names Neo4j explicitly — "We are using Neo4j to create these human-readable images about the mappings, since it has, in our opinion, the best look-and-feel in the current landscape of knowledge graph technologies". The post's justification for Neo4j is explicitly UX / domain- expert communication ("most domain experts can read these images much better than the above mentioned data model definition files"), not query-path or scale. The choice is an auto-generation story: "creating images manually would generate more manual and error-prone work" given tens of tables and hundreds of columns (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  3. Business-engineering alignment is named as the #1 return. The post is unusually explicit that the primary benefit is dialogue quality, not technology: "The dialogue between business and technology in designing the golden record logical data model has improved and accelerated the process of creating a correct model." The visual graph is the artifact both sides can read; the domain expert sees their business concepts in the same picture as the engineer's columns, and SQL schemas / spreadsheets no longer gatekeep understanding. This is positioned as the fix for the "limited business know-how" problem named in the MDM drawbacks list (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  4. Data lineage falls out of the graph for free. Because every source column → concept / attribute / relationship edge is recorded, the graph "keep[s] a record of data lineage from each system to the golden record". Any golden-record field can be queried back to every contributing source column across every source system, and vice versa. This is an ancillary benefit, not the motivating use case — but it's a capability most MDM projects pay for separately via a lineage tool (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

  5. The post scopes itself to consolidated-style MDM in early phase. Zalando states: "At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style." Consolidated style = ingest from source systems, process through match-and-merge, cleanse and quality-assure, then store centrally per a canonical model, with the golden record "published back to the source systems for consideration and possible correction." The post explicitly does not describe the match-and-merge subsystem, the data-quality layer, or the publish-back mechanism — only the data-model-definition phase (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).

Systems extracted

  • systems/neo4j — the property-graph database / visualisation engine Zalando uses to render the mapping graph for domain experts. First wiki system page for Neo4j.
  • systems/zalando-mdm-system — Zalando's in-house Master Data Management component (in-design as of mid-2021). Uses the knowledge graph as the model-authoring substrate from which the logical data model + transformation data model are generated.

Concepts extracted

  • concepts/master-data-management — the discipline "in which business and Information Technology work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise's official shared master data assets." First wiki concept page.
  • concepts/golden-record — the MDM-specific term for "a common, shared, and trusted view on data for a particular domain," produced by consolidation over multiple source systems.
  • concepts/logical-data-model — the schema-of-tables-and- columns deliverable. In Zalando's graph workflow it is generated from concepts, attributes, and relationships rather than directly authored.
  • concepts/transformation-data-model — the per-system mapping artifact showing how each source column maps (directly or indirectly) to the golden-record schema. Required alongside the logical model in any consolidated MDM.
  • concepts/semantic-layer-of-business-concepts — the graph of Concept / Attribute / Relationship nodes sitting between source-system tables and the golden-record schema. This is the "shared conceptual vocabulary" that makes business-engineering alignment tractable.
  • concepts/knowledge-graph — extended with Zalando MDM as its third canonical instance: business concepts, attributes, and relationships as nodes in a named directed graph, used as a data-modeling substrate for MDM.
  • concepts/data-lineage — extended: Zalando's MDM graph records lineage as a side effect of the column→concept mappings; golden-record fields are traceable back to source columns.
  • concepts/domain-model · [[concepts/semantic- interoperability]] — Zalando's usage extends these concepts out of Netflix UDA framing into a practical MDM setting at smaller scale.

Patterns extracted

  • patterns/knowledge-graph-for-mdm-modeling — store System / Table / Column / Concept / Attribute / Relationship in a graph; generate the logical data model and transformation data model from the graph rather than authoring them directly.
  • patterns/mapping-driven-schema-generation — the broader pattern: the mapping (source → target) is authoritative; both the target schema and the transformation code are derivatives.
  • [[patterns/visual-graph-for-business-engineering- alignment]] — using auto-generated graph visualisations (Neo4j-rendered images) as the primary communication artifact with non-technical domain experts, in place of SQL DDL or spreadsheets.

Operational numbers

Zalando discloses one concrete scale figure:

  • "Currently, we are mapping tens of tables and hundreds of columns" — the project was at early-phase scope, large enough that manual diagram maintenance was intractable but small enough to be handled by a single Python generator script.

No latency, throughput, or QPS numbers are given. No numbers on the number of source systems, the number of golden-record entities, or the Neo4j cluster size. The address-concept worked example names exactly two source systems (System A with three address_line_N columns, System B with street / zip_code / city / country_code columns).

Caveats

  • Not a production deployment. The post describes the data-model-definition phase. The match-and-merge, cleansing, data-quality, and publish-back subsystems are named as separate components but are not discussed; whether they are built yet is unclear.
  • Neo4j is used for visualisation, not query. The post never describes runtime queries against the graph at ingest / consolidation time. The graph appears to be used at design time — Python walks it to generate schemas and the visualiser renders it for humans. Operational MDM at runtime is out of scope.
  • No open-source release. Unlike [[systems/zalando- postgres-operator]], systems/skipper-proxy, or systems/randomizer-swift, this knowledge-graph + MDM component is not released. It's an internal technique post, not a tool post.
  • Small-scale; not generalised. "Tens of tables and hundreds of columns" is well below the scale at which graph-DB query performance becomes a concern (contrast Dropbox Dash's rejection of graph DBs for latency reasons, or Netflix UDA's RDF+SHACL substrate at enterprise scale). The Zalando solution may not generalise to the scale where the graph itself is performance-critical.
  • No match-and-merge semantics. The post avoids the hard part of MDM: entity resolution. Given System A's business_partner_id=42 and System B's id=99, how does the system decide they refer to the same business partner? Zalando's post handles the structural mapping (which column maps to which concept), not the instance-level match.
  • Transformation functions are not detailed. Indirect mappings imply per-source transformation functions (e.g. how address_line_1/2/3 becomes structured attributes). How these functions are authored, stored, tested, or versioned is undiscussed.
  • Graph-authoring workflow unstated. The domain-expert contribution flow is named at high level ("a domain expert can provide us with these definitions and some coordination that the exact same name for concepts, attributes, and relationships is required") but the concrete tooling — UI, form, spreadsheet-import, Cypher scripts? — isn't specified.
  • Recruiting post sub-message. The post closes with a Data Engineer hiring link; a small fraction of the surface area is recruiting. Still passes scope (the architecture content is >75% of the body).
  • Low-ceremony knowledge-graph framing. The term "knowledge graph" is used loosely — Zalando's graph does not have RDF, SHACL, ontology governance, or an upper metamodel in the sense of Netflix UDA (systems/netflix-uda). It is a property graph in Neo4j. This is a legitimate, accessible usage, but the reader should not expect semantic-web tooling.

Source

Last updated · 476 distilled / 1,218 read