PATTERN Cited by 1 source
Mapping-driven schema generation¶
The pattern¶
Make the mapping from source schemas to a conceptual layer the authoritative artifact, and derive both (1) the target schema and (2) the transformation code from it. Do not author the target schema directly.
This inverts the traditional workflow — where the target schema is the authoritative deliverable and mappings accrete as textual specifications — because the mapping is the only artifact that natively captures both what to store and how to populate it from sources.
When to use it¶
- You have multiple source systems with heterogeneous schemas that must be consolidated into one target schema.
- Domain experts (not engineers) are the source of truth for the conceptual model.
- Both the target schema and the per-source transformation code are required deliverables (canonical examples: MDM, data-warehouse ETL, enterprise data-integration platforms).
- The target schema is expected to evolve; coupling it directly to source-system schemas would make refactoring painful.
Implementations¶
- patterns/knowledge-graph-for-mdm-modeling (Zalando MDM, 2021) — mappings stored as Neo4j graph edges, Python generator emits SQL DDL for the golden record's logical data model and per-source-system transformation data model (sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition).
- patterns/schema-transpilation-from-domain-model (Netflix UDA) — domain models authored in Upper; transpilers emit GraphQL / Avro / SQL / RDF / Java schemas. Mappings to data containers are first-class graph edges; pipelines are auto-provisioned.
- Data-build tools (external — dbt, Coalesce) — SQL-based realisations where the mapping between source and target tables is the authored artifact and target schemas are derived from model definitions.
Why it works¶
- Single source of truth. The mapping captures target schema and data provenance simultaneously.
- Composable over source changes. When a source system gains a new column, the mapping gets one new entry; the target schema and transformation code regenerate without human intervention.
- Data lineage is free. Every target field is traceable to every contributing source column.
- Reduces drift. Target-schema-first workflows tend to develop inconsistencies between the schema, the transformation code, and the documentation; mapping-driven generation eliminates the category.
Trade-offs¶
- Upfront investment in the mapping language / vocabulary. The conceptual layer must be stable and well-understood before mappings can be authored.
- Generator quality is load-bearing. Bugs in the generator propagate everywhere. Test coverage of the generator matters more than test coverage of hand-written schemas.
- Doesn't solve entity resolution. The pattern handles structural mapping; instance-level identity (match-and-merge in MDM, de-duplication in ETL) is orthogonal.
- Target-schema optimisations are hard. Hand-written schemas can carry ad-hoc indexes, denormalisations, and storage tweaks. Generator-emitted schemas default to a mirror of the conceptual layer; optimisations require generator-side annotations.
Seen in¶
- sources/2021-07-28-zalando-knowledge-graph-technologies-accelerate-and-improve-the-data-model-definition — Zalando's MDM team uses this pattern to derive both the logical data model and the transformation data model from a knowledge graph of column-to-concept mappings.
Related¶
- concepts/logical-data-model · [[concepts/transformation- data-model]] — the two deliverables this pattern generates in the MDM instance
- concepts/knowledge-graph — a common substrate for storing the mappings
- concepts/semantic-layer-of-business-concepts — the intermediate concept layer the mappings target
- concepts/master-data-management — the canonical problem domain
- systems/zalando-mdm-system — canonical wiki instance
- patterns/knowledge-graph-for-mdm-modeling — the specific MDM realisation
- patterns/schema-transpilation-from-domain-model — the enterprise-scale realisation at Netflix UDA