Skip to content

SYSTEM Cited by 1 source

Yelp Partition Access Visualization

Yelp's internal partition-access-visualisation tooling — a SQL-aggregate + chart layer on top of the Yelp S3 SAL pipeline that exposes who (IAM role) is reading what (table + partition value) at when (event time), so data owners can read consumer signatures off a chart instead of via stakeholder conversations or stale docs.

Role

Canonical home for Yelp's usage-attribution-as-data-stewardship substrate. Single-system surface for four downstream use cases:

  1. Stakeholder discovery"who consumes this table" answered off a chart per data owner.
  2. Storage-class routing — feeds the decision to put a table on Intelligent Tiering or onto Default Access Retention.
  3. Apache Iceberg migration prioritisationpatterns/usage-driven-migration-prioritization uses the same data to rank the migration backlog by active-table value.
  4. Incident response / cost attribution — slice by IAM role, table, partition, time window.

(Source: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33)

Architecture

The system is a batch-driven SQL aggregation that consumes the output of the daily SAL → Parquet compaction job documented in the 2025-09-26 SAL-pipeline post.

Aggregation SQL (verbatim)

INSERT INTO table_usage_aggregated
SELECT
    COUNT(1)                    AS ct,
    requester                   AS iam_role,
    "timestamp"                 AS event_time,
    KEY_TO_TABLE_NAME(key)      AS table_name,
    KEY_TO_PARTITION_VALUE(key) AS partition_value
FROM
    s3_server_access_logs_compacted
WHERE
    bucket_name IN ('BUCKET1', 'BUCKET2')
    AND "timestamp" = 'yyyy/mm/dd'
    AND operation = 'REST.GET.OBJECT'
    AND key LIKE 'prefix_to_include%'
GROUP BY
    2, 3, 4, 5
  • Filter on REST.GET.OBJECT: confines attribution to actual reads. Other operations (PUT, DELETE, lifecycle expirations) are not relevant to consumer attribution.
  • requester: the SAL-supplied IAM principal — stable, human- meaningful entity coordinate.
  • KEY_TO_TABLE_NAME(key) / KEY_TO_PARTITION_VALUE(key): UDF stubs that map the S3 key to catalog identity. Alternative: "join against your catalog metadata containing database name, table name, table location, and partition spec."
  • Grouping by (iam_role, event_time, table_name, partition_value) produces the four-tuple aggregate that drives all visualisations and downstream dashboards.

Optional join against S3 Inventory

"You may find it additionally useful to join against S3 Inventory to understand how usage relates to the defined storage classes for your analytics tables." — links the "what was accessed" signal back to "what storage tier are those objects on," completing the observability loop for storage- class decisioning.

Visualisations

Two named charts:

1. Partitions Accessed Vs Time

X-axis: access event time. Y-axis: partition value (e.g. dt=yyyy-mm-dd). Each access is a point; points are coloured by accessing IAM role.

The chart surfaces the three partition-access-pattern signatures with no manual labelling:

  • Diagonal y=x — daily batch consumer (today's job reads today's partition, yesterday's reads yesterday's, …).
  • Vertical line — backfill (one job at one moment scanning many partitions).
  • Scatter, no pattern — ad hoc inspection (typically Yelp internal engineering teams checking that data meets expectations).

2. Partitions Accessed (over a window)

Answers "what are the access patterns for table X in the last N months?" — a per-partition access-count rollup over a configurable window. Used for retention decisions: a partition that hasn't been read in the last N months is a candidate for deletion / cold-tiering / Default Access Retention.

Substrate dependencies

  • systems/yelp-s3-sal-pipeline — load-bearing upstream. The daily SAL → Parquet compaction (Tron, Athena INSERT, Glue partition projection, idempotent self-LEFT-JOIN on requestid) is what makes the aggregation SQL queryable at TiB-per-day SAL volume. Without the compaction layer, Athena would hit S3 API rate limits attempting to read raw SAL.
  • systems/amazon-athena — the query engine for both the aggregation INSERT and the downstream visualisation queries.
  • systems/aws-iam — provides the entity coordinate (the SAL requester is an IAM principal ARN; Yelp's role-naming convention links roles back to services / teams).
  • systems/s3-inventory — optional join-side for storage-class awareness.

Downstream consumers

  • Data owners — read partition-access-pattern charts directly, identify their consumers, decide retention.
  • Default Access Retention workflow — consumes the per-partition access-count to decide which partitions sit beyond the access window and should be IAM-gated. See patterns/iam-policy-gated-cold-tier-access.
  • Iceberg migration backlog — see patterns/usage-driven-migration-prioritization. Most-accessed tables and partitions migrate first.
  • Cost attribution — same aggregate, rolled up to IAM-role grain, drives chargeback and incident investigation.

Reported outcomes

The visualisation tooling, combined with the storage-class strategy and Default Access Retention adoption it enabled, produced 33% reduction in S3 storage cost on Yelp's petabyte-scale data lake, plus Iceberg migration prioritisation focused on active tables. (Source: sources/2026-05-21-yelp-how-partition-access-visualizations-reduced-our-data-lake-s3-cost-by-33)

The post does not disclose the cost decomposition (deletion vs IT adoption vs DAR adoption), the time horizon, or the absolute baseline.

Caveats

  • Time-partitioned-tables only — the diagonal y=x signature requires the partition key to be time-shaped (e.g. dt=yyyy-mm-dd). Hash-partitioned or category-partitioned tables would not yield the same visual signatures.
  • Catalog-key mapping is not standardisedKEY_TO_TABLE_NAME / KEY_TO_PARTITION_VALUE are UDF stubs in the post; real implementations either implement these UDFs against Yelp's S3-key convention or join against catalog metadata.
  • Same-account same-region SAL is the implicit substrate configuration (per the 2025-09-26 SAL pipeline disclosure).

Seen in

Last updated · 542 distilled / 1,571 read