PATTERN Cited by 1 source
Access pattern visualisation for data stewardship¶
Problem¶
Data owners cannot effectively steward their tables without knowing who consumes them. The historical fallback — stakeholder conversations, internal documentation, word-of-mouth folklore — "quickly become out of date" and produces enough uncertainty that retention / cold-tiering / migration decisions are blocked by an unbounded "we might break a stakeholder we don't know about" risk premium.
Per-bucket and per-table access counts are too coarse to surface the load-bearing signal: what shape is the consumption — daily batch, periodic backfill, ad hoc inspection — and which entity (team, service) is generating it.
Pattern¶
Plot partition values accessed (y) against access-event time (x), coloured by accessing entity (IAM role).
The visualisation surfaces three canonical partition-access-pattern signatures with no manual labelling:
- Diagonal y=x — daily batch consumer.
- Vertical line — backfill scanning many partitions at one moment.
- Scatter — ad hoc inspection (typically internal engineering).
Together with the IAM-role colour axis (per patterns/iam-role-attribution-from-s3-access-logs), the chart replaces stakeholder conversations and stale documentation with an objective, continuously-updated artifact that any data owner can read directly.
Canonical shape (Yelp 2026-05-21)¶
1. Compute the per-event aggregate¶
A daily SQL aggregation over compacted S3 server access logs:
INSERT INTO table_usage_aggregated
SELECT
COUNT(1) AS ct,
requester AS iam_role,
"timestamp" AS event_time,
KEY_TO_TABLE_NAME(key) AS table_name,
KEY_TO_PARTITION_VALUE(key) AS partition_value
FROM
s3_server_access_logs_compacted
WHERE
bucket_name IN ('BUCKET1', 'BUCKET2')
AND "timestamp" = 'yyyy/mm/dd'
AND operation = 'REST.GET.OBJECT'
AND key LIKE 'prefix_to_include%'
GROUP BY
2, 3, 4, 5
The four-tuple (table, partition, iam_role, event_time) is the
load-bearing aggregate.
2. Plot two complementary charts¶
- Partitions Accessed Vs Time — y = partition value, x = event time, colour = IAM role. Surfaces signature shapes.
- Partitions Accessed (over a window) — per-partition access count over a configurable window, answers "what are the access patterns for table X in the last N months?"
3. Optional join against catalog / inventory¶
- Catalog metadata — replace the
KEY_TO_TABLE_NAME/KEY_TO_PARTITION_VALUEUDFs with a join against database + table + partition-spec for catalog-aware attribution. - S3 Inventory — links access signal to current storage class, completing the observability loop for storage-class routing.
(Source: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33)
Forces that make the pattern work¶
- Temporal partition keys are already a coordinate. The y-axis maps to a date; the x-axis is event time; both are in the same units, so a y=x diagonal has literal semantic meaning.
- Three primitive consumer types map to three visually distinct shapes. Diagonal (batch) ≠ vertical (backfill) ≠ scatter (ad hoc) — the inverse mapping is unambiguous.
- IAM-role colour coding distinguishes which batch / which backfill / which engineer.
- The substrate amortises across other use cases. Yelp's 2025-09-26 SAL pipeline was justified by other use cases first (cost attribution, access-based retention); this pattern is a downstream consumer that doesn't need its own substrate.
Forces that limit the pattern¶
- Time-partitioned tables only. Hash-partitioned, category- partitioned, or bucketed schemes do not yield the same signatures.
- Fleet-scale SAL volume requires compaction. Without a compacted Parquet substrate (à la systems/yelp-s3-sal-pipeline), the SQL aggregation is prohibitively expensive at TiB-per-day SAL volumes.
- The catalog-key mapping is non-trivial. UDFs or joins must reflect the deployment's actual S3-key convention.
Downstream wins (Yelp's three-pillar framing)¶
The pattern unlocks the three platform-team value pillars documented in concepts/granular-usage-attribution:
- Customer value — data owners get definitive consumer answers.
- Business value — Yelp reported 33% S3 cost reduction driven by storage-class routing and DAR adoption that the visualisation enabled.
- Platform efficiency — Apache Iceberg migration prioritised by usage data; see patterns/usage-driven-migration-prioritization.
Anti-pattern: stakeholder conversations only¶
Yelp's framing names the failure mode the pattern replaces:
"Without data-driven usage attribution, teams rely on stakeholder conversations, documentation, and word of mouth — sources that quickly become out of date. This makes it difficult for data owners to effectively steward their data and limits the support that platform teams can autonomously offer."
The anti-pattern's failure is structural: stakeholder conversations require ongoing operator effort, miss out-of-band consumers, and cannot keep up with table-fleet scale.
Seen in¶
- sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33 — canonical Yelp disclosure.
Related¶
- concepts/partition-access-pattern — the signatures the visualisation surfaces.
- concepts/granular-usage-attribution — the meta-concept this pattern instantiates.
- concepts/s3-server-access-logs — substrate.
- systems/yelp-partition-access-visualization — canonical implementation.
- systems/yelp-s3-sal-pipeline — substrate pipeline.
- patterns/iam-role-attribution-from-s3-access-logs — the entity-coordinate primitive.
- patterns/usage-driven-migration-prioritization — downstream use case.
- patterns/iam-policy-gated-cold-tier-access — downstream policy primitive (Default Access Retention) driven by the visualisation's access-window decision.