Skip to content

CONCEPT Cited by 1 source

Data lineage

Definition

Data lineage is the graph of relationships between data assets that tracks "where did this data come from" and "where does this data flow to" — source → sink relationships — across systems. Lineage graphs are built via static code analysis, runtime logging, query parsing, and post-processing.

Uses

  • Data governance — tracking PII propagation across warehouses and services.
  • Pipeline debugging"why did this dashboard change?"
  • Compliance discovery — finding all downstream consumers of a regulated data set.
  • Policy rollout — in Meta's 2024-08-31 PAI post, lineage is used as the discovery primitive inside PZM Step 2 — the requirement owner queries lineage to find all sinks downstream of an annotated source, then decides how to remediate each.

Lineage as enforcement: insufficient at scale

The 2024-08-31 Meta post is explicit that lineage alone is not a sufficient enforcement primitive:

"The combination of point checking and data lineage, while viable at a small scale, leads to significant operational overhead as point checking still requires auditing many individual assets."

Lineage gives you the graph, but to enforce a purpose-limitation requirement you still need to audit each asset's point-check code to ensure it respects the propagated permission. Meta's conclusion: lineage is necessary for discovery but IFC (Policy Zones) is needed for enforcement. PZM retains lineage inside the tool but delegates enforcement to Policy Zones.

Discovery techniques named at Meta

  • Static code analysis — e.g. Meta's Zoncolan (Hack static analyser; cited in the PAI post).
  • Logging and post-query processing — runtime trace reconstruction of data flows.
  • Implicitly: SQL query parsing for batch pipelines (Presto / Spark lineage).

Seen in

Last updated · 319 distilled / 1,201 read