CONCEPT Cited by 2 sources
Data lineage¶
Definition¶
Data lineage is the graph of relationships between data assets that tracks "where did this data come from" and "where does this data flow to" — source → sink relationships — across systems. Lineage graphs are built via static code analysis, runtime logging, query parsing, and post-processing.
Uses¶
- Data governance — tracking PII propagation across warehouses and services.
- Pipeline debugging — "why did this dashboard change?"
- Compliance discovery — finding all downstream consumers of a regulated data set.
- Policy rollout — in Meta's 2024-08-31 PAI post, lineage is used as the discovery primitive inside PZM Step 2 — the requirement owner queries lineage to find all sinks downstream of an annotated source, then decides how to remediate each.
Lineage as enforcement: insufficient at scale¶
The 2024-08-31 Meta post is explicit that lineage alone is not a sufficient enforcement primitive:
"The combination of point checking and data lineage, while viable at a small scale, leads to significant operational overhead as point checking still requires auditing many individual assets."
Lineage gives you the graph, but to enforce a purpose-limitation requirement you still need to audit each asset's point-check code to ensure it respects the propagated permission. Meta's conclusion: lineage is necessary for discovery but IFC (Policy Zones) is needed for enforcement. PZM retains lineage inside the tool but delegates enforcement to Policy Zones.
Discovery techniques named at Meta¶
- Static code analysis — e.g. Meta's Zoncolan (Hack static analyser; cited in the PAI post).
- Logging and post-query processing — runtime trace reconstruction of data flows.
- Implicitly: SQL query parsing for batch pipelines (Presto / Spark lineage).
Seen in¶
- sources/2024-08-31-meta-enforces-purpose-limitation-via-privacy-aware-infrastructure — framed as the discovery primitive inside PZM but explicitly rejected as a sufficient enforcement primitive at Meta scale.
- sources/2025-10-28-redpanda-governed-autonomy-the-path-to-enterprise-agentic-ai — 2025-10-28 ADP announcement conflates lineage with audit trail as "unified audit and lineage envelope" at the agent-interaction altitude. The lineage axis answers "what data flowed into this agent decision?" (retrieved context + tool-call outputs) while the audit axis answers "who did what, when, with what result?"; the ADP positions the streaming-log substrate as the joint source for both view-shapes. See patterns/durable-event-log-as-agent-audit-envelope. Post does not unpack lineage mechanism — "complete lineage" is a property claim, not a traced graph.
- — lineage as a side-effect capability of knowledge-graph-based MDM data-model-definition. Because Zalando records every source column → Concept / Attribute / Relationship mapping as graph edges, lineage from golden-record field back to every contributing source column across every source system falls out for free: "this enables us to keep a record of data lineage from each system to the golden record." Zalando's framing sits at a complementary altitude to the Meta / Redpanda instances above — lineage as a design-time byproduct of the data-modeling substrate, not a runtime-tracing or governance-primitive story.
- sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph — Netflix MDS as a lineage system built **from change events
- source-of-truth hydration rather than static analysis or
query parsing. Six ML source systems emit thin
notification-of-change
events over Kafka +
SNS / SQS; MDS
hydrates each entity's full state from the source API and
walks foreign-key references. Async enrichment jobs derive
multi-hop transitive lineage edges (e.g.
Model Instance → Pipeline Run → Datasetcollapsed into a directModel Instance ↔ Datasetmaterialized edge). Distinct from Meta's Zoncolan static-analysis approach (lineage from code) and from CDC-trace approaches (lineage from data plane); this is lineage from app-emitted change events with API callback hydration**.
Related¶
- concepts/information-flow-control — the enforcement successor.
- concepts/point-checking-controls — the approach lineage was meant to augment.
- concepts/purpose-limitation — the requirement class Meta was trying to enforce via lineage before adopting IFC.
- concepts/data-annotation — the IFC primitive that replaces the point-check-plus-lineage combination.
- systems/meta-policy-zones — the IFC system.
- systems/meta-policy-zone-manager — lineage's UX home at Meta.
- companies/meta
- concepts/knowledge-graph — substrate that makes lineage a design-time byproduct in Zalando MDM.
- concepts/master-data-management — the problem domain in the Zalando instance.
- systems/zalando-mdm-system — Zalando MDM canonical wiki instance.
- patterns/knowledge-graph-for-mdm-modeling — the pattern that delivers lineage as a side effect.