Skip to content

SYSTEM Cited by 1 source

Zalando MDM system

What it is

Zalando's in-house Master Data Management component — in early-phase design as of mid-2021 (sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition). Implements consolidated-style MDM: ingest from source systems → match-and-merge → cleanse → store centrally per a canonical data model → publish the consolidated golden record back to source systems for correction.

"At Zalando we are at an early phase of realising MDM for our internal data assets and we have chosen to do it in a consolidated style." — Zalando 2021-07-28 (Source: sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition)

The post describes the data-model-definition phase of this system — specifically, how the logical data model and transformation data model are derived from a knowledge graph. Match-and-merge, data-quality, and publish-back subsystems are acknowledged but not described.

Knowledge-graph-based modeling

The defining architectural choice: the data model is not authored directly. Instead, Zalando authors a knowledge graph in Neo4j whose nodes are:

  • System — name of one source system.
  • Table — a table from a particular system.
  • Column — one column with schema type info.
  • Concept — a business concept (e.g. Address, Business Partner).
  • Attribute — a property of a concept (e.g. street name).
  • Relationship — a typed directed edge between two concepts (e.g. Business Partner "has contact" Address).

Domain experts map each System column to a Concept, optionally to an Attribute, and optionally to a Relationship direction (source or target). Mappings are either direct (1-to-1) or indirect (1-to-many, requiring a transformation algorithm).

A Python script then walks the graph and generates:

  1. The logical data model of the golden record — one table per Concept (columns = its Attributes + internal ID), one join table per Relationship (FKs to source + target Concept IDs).
  2. The transformation data model — per source system, how each column maps (directly or indirectly) to the logical model, emitting either a 1:1 copy or a transformation- function call.

Architectural properties

  • Mapping is source-of-truth. The column→concept mapping graph is authoritative; both the golden-record schema and the per-system transformation schema are derivatives (patterns/mapping-driven-schema-generation).
  • Business-engineering alignment as first-class concern. The Neo4j-rendered graph doubles as the communication artifact between domain experts and engineers, replacing SQL DDL / spreadsheets ([[patterns/visual-graph-for-business-engineering- alignment]]).
  • Data lineage as side effect. Every golden-record field is traceable back to every contributing source column across every source system — no separate lineage tool needed (concepts/data-lineage).
  • Design-time graph, not runtime graph. The graph is queried by the Python generator at design time, not at MDM ingest / consolidation time. Runtime architecture is out of scope in the post.

Operational disclosure

  • Scale: "tens of tables and hundreds of columns" — large enough that manual diagram maintenance is intractable, small enough for a single Python generator.
  • Source systems in worked example: 2 (System A with 3 address_line_N columns; System B with street / zip_code / city / country_code).
  • Match-and-merge subsystem: named but not described.
  • Data-quality layer: named but not described.
  • Publish-back mechanism: named but not described.

Seen in

Last updated · 476 distilled / 1,218 read