CONCEPT Cited by 1 source
Granular usage attribution¶
Definition¶
Granular usage attribution is the observability primitive of tracking who reads what data, at what granularity finer than the table or bucket level — typically at per-table per-partition per-entity per-time-window grain.
The granularity matters: per-bucket counts answer "is this bucket hot or cold"; per-table counts answer "is this table active"; per-partition + per-entity counts answer "who is reading the cold partitions of this active table, and what shape is their access pattern".
The Yelp argument: granularity is the gate¶
"In large analytics environments, data teams often struggle to answer deceptively simple questions, like who their stakeholders are and how their data is being used. ... Without data-driven usage attribution, teams rely on stakeholder conversations, documentation, and word of mouth — sources that quickly become out of date. This makes it difficult for data owners to effectively steward their data and limits the support that platform teams can autonomously offer. Granular usage attribution solves this problem by enabling clear insight into how data is consumed and unlocks opportunities for significant cost efficiencies."
The post argues structurally that observability of consumption is a precondition for storage cost optimisation. Without granular attribution, every cost-cutting decision (delete this partition? move this table to cold storage? migrate that table to Iceberg first?) carries an unbounded "we might break a stakeholder we don't know about" risk premium that defeats the cost-cutting business case.
(Source: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33)
What "granular" means in practice¶
Yelp's (table, partition, iam_role, event_time) four-tuple is the
canonical wiki shape:
- Table — the catalog identity (database + table name) the data belongs to.
- Partition — the partition value (e.g.
dt=2026-05-15) the read accessed. Crucial for concepts/partition-access-pattern signature recognition. - Entity — the identity that performed the read. AWS-native:
the IAM role from the SAL
requesterfield. See patterns/iam-role-attribution-from-s3-access-logs. - Event time — when the read happened. Typically aggregated to daily granularity to match the SAL compaction cadence.
This four-tuple is the load-bearing aggregate that drives all five named downstream use cases (stakeholder discovery, storage- class decisioning, retention, Iceberg migration prioritisation, incident response).
The unlock thesis: three pillars of platform-team value¶
Yelp frames granular usage attribution as unlocking three pillars of value for a data-platform team:
- Customer value — "a new feature to definitively track table consumers and usage patterns, giving data owners confident answers about who consumes their tables."
- Business value — "reducing S3 storage cost by 33%" via confidence to expand deletion-based retention and assign cost-effective storage classes.
- Platform efficiency — "focus our migration efforts on active tables and partitions" — Apache Iceberg adoption prioritised by actual consumption.
The structural insight: single-purpose observability investments amortise across multi-purpose platform substrates. The same per-partition per-entity aggregate serves migration prioritisation, storage-class routing, retention policy, and incident response.
(Source: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33)
Distinct from data lineage¶
Granular usage attribution is about reads — who consumes data. concepts/data-lineage is about flows — where data comes from and where it goes. Both are graphs, both serve governance, but they answer different questions:
- Lineage: "which dashboards / pipelines depend on this table?" (graph traversal of declared dependencies).
- Usage attribution: "who actually accessed this table's partitions, in what shape, in the last week?" (aggregation of observed reads).
Lineage gives discovery without certainty (a declared dependency may be dead code); usage attribution gives certainty without discovery of the application logic (the SQL author may not be the ultimate business owner of the dashboard).
In practice, they complement each other — Yelp's post explicitly nominates lineage as a future investment alongside continued expansion of granular usage attribution.
Substrate options¶
- AWS-native: concepts/s3-server-access-logs (Yelp's choice; cheapest at fleet scale).
- AWS pricier alternative: systems/aws-cloudtrail Data Events ("$1 per million data events — orders of magnitude higher!" per the 2025-09-26 Yelp SAL post).
- Catalog-mediated: query-engine logs (Athena, Presto, Spark) — give SQL-level attribution but miss out-of-band reads (e.g. direct S3 reads from notebooks).
- Application-level instrumentation: per-system code paths emit consumption telemetry — high signal, high integration cost.
The Yelp argument for SAL: it is out-of-band (no application changes), comprehensive (all reads, including those that don't go through a query engine), and cheap at scale (the daily compaction layer makes TiB-per-day SAL Athena-queryable).
Seen in¶
- sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33 — canonical wiki argument for granular usage attribution as a gating observability primitive for data-platform efficiency wins.
- (Substrate) sources/2025-09-26-yelp-s3-server-access-logs-at-scale — the SAL pipeline that makes this attribution affordable at fleet scale.
Related¶
- concepts/partition-access-pattern — the signature-recognition capability granular attribution unlocks.
- concepts/s3-server-access-logs — the AWS-native substrate.
- concepts/data-lineage — the complementary observability primitive (where data flows vs who reads it).
- systems/yelp-s3-sal-pipeline — Yelp's substrate.
- systems/yelp-partition-access-visualization — Yelp's consumption layer.
- patterns/access-pattern-visualization-for-data-stewardship — pattern.
- patterns/iam-role-attribution-from-s3-access-logs — entity coordinate.