Skip to content

CONCEPT Cited by 2 sources

Data lineage

Definition

Data lineage is the graph of relationships between data assets that tracks "where did this data come from" and "where does this data flow to" — source → sink relationships — across systems. Lineage graphs are built via static code analysis, runtime logging, query parsing, and post-processing.

Uses

  • Data governance — tracking PII propagation across warehouses and services.
  • Pipeline debugging"why did this dashboard change?"
  • Compliance discovery — finding all downstream consumers of a regulated data set.
  • Policy rollout — in Meta's 2024-08-31 PAI post, lineage is used as the discovery primitive inside PZM Step 2 — the requirement owner queries lineage to find all sinks downstream of an annotated source, then decides how to remediate each.

Lineage as enforcement: insufficient at scale

The 2024-08-31 Meta post is explicit that lineage alone is not a sufficient enforcement primitive:

"The combination of point checking and data lineage, while viable at a small scale, leads to significant operational overhead as point checking still requires auditing many individual assets."

Lineage gives you the graph, but to enforce a purpose-limitation requirement you still need to audit each asset's point-check code to ensure it respects the propagated permission. Meta's conclusion: lineage is necessary for discovery but IFC (Policy Zones) is needed for enforcement. PZM retains lineage inside the tool but delegates enforcement to Policy Zones.

Discovery techniques named at Meta

  • Static code analysis — e.g. Meta's Zoncolan (Hack static analyser; cited in the PAI post).
  • Logging and post-query processing — runtime trace reconstruction of data flows.
  • Implicitly: SQL query parsing for batch pipelines (Presto / Spark lineage).

Seen in

  • sources/2024-08-31-meta-enforces-purpose-limitation-via-privacy-aware-infrastructure — framed as the discovery primitive inside PZM but explicitly rejected as a sufficient enforcement primitive at Meta scale.
  • sources/2025-10-28-redpanda-governed-autonomy-the-path-to-enterprise-agentic-ai — 2025-10-28 ADP announcement conflates lineage with audit trail as "unified audit and lineage envelope" at the agent-interaction altitude. The lineage axis answers "what data flowed into this agent decision?" (retrieved context + tool-call outputs) while the audit axis answers "who did what, when, with what result?"; the ADP positions the streaming-log substrate as the joint source for both view-shapes. See patterns/durable-event-log-as-agent-audit-envelope. Post does not unpack lineage mechanism — "complete lineage" is a property claim, not a traced graph.
  • — lineage as a side-effect capability of knowledge-graph-based MDM data-model-definition. Because Zalando records every source column → Concept / Attribute / Relationship mapping as graph edges, lineage from golden-record field back to every contributing source column across every source system falls out for free: "this enables us to keep a record of data lineage from each system to the golden record." Zalando's framing sits at a complementary altitude to the Meta / Redpanda instances above — lineage as a design-time byproduct of the data-modeling substrate, not a runtime-tracing or governance-primitive story.
  • sources/2026-05-04-netflix-democratizing-machine-learning-building-the-model-lifecycle-graph — Netflix MDS as a lineage system built **from change events
  • source-of-truth hydration rather than static analysis or query parsing. Six ML source systems emit thin notification-of-change events over Kafka + SNS / SQS; MDS hydrates each entity's full state from the source API and walks foreign-key references. Async enrichment jobs derive multi-hop transitive lineage edges (e.g. Model Instance → Pipeline Run → Dataset collapsed into a direct Model Instance ↔ Dataset materialized edge). Distinct from Meta's Zoncolan static-analysis approach (lineage from code) and from CDC-trace approaches (lineage from data plane); this is lineage from app-emitted change events with API callback hydration**.
Last updated · 542 distilled / 1,571 read