Yelp — How Partition Access Visualizations Reduced our Data Lake S3 Cost by 33%¶
Summary¶
Yelp Engineering post (2026-05-21) by the data-platform team disclosing
the partition access visualization technique they built on top of
the Yelp S3 SAL pipeline (canonicalised
in the 2025-09-26 SAL post) and the three-pillar platform-team
payoff that visualisation unlocked: (1) customer value — data
owners get a definitive answer to "who consumes this table and how";
(2) business value — a 33% reduction in S3 storage cost on
Yelp's petabyte-scale data lake driven by confidence to expand
deletion-based retention and assign more cost-effective S3 storage
classes; (3) platform efficiency — Apache Iceberg migration
prioritised toward the most-accessed tables and partitions so the
biggest customer-facing latency wins land first. The visualisation
shape is load-bearing: plot time-based partition keys (e.g.
dt=yyyy-mm-dd) on one axis against access-event timestamps on the
other, with the accessing IAM role colour-coded — three signature
shapes emerge with no manual labelling. Diagonal y=x lines
identify daily batch consumers (today's job reads today's partition).
Vertical lines identify backfills (one job scanning many
partitions at one moment). Scatter with no pattern identifies ad
hoc inspection by internal engineering teams. The implementation is a
SQL aggregation over S3 server access
logs grouping by (table, partition, iam_role, event_time) filtered
to operation = 'REST.GET.OBJECT' — the per-bucket per-prefix key
field is parsed via KEY_TO_TABLE_NAME(key) /
KEY_TO_PARTITION_VALUE(key) UDFs that map S3 keys to catalog
identities (alternatively, joinable against catalog metadata holding
database/table/location/partition-spec). Joining against
S3 Inventory also gives the current storage
class per object — the bridge from "who accessed this" to "what does
keeping this cost". The post then canonicalises two storage-class
strategy disclosures: (A) S3
Intelligent Tiering (IT) is the default for unpredictable access
patterns because "cost scales down automatically with reduced
access, and there is no penalty if access patterns change" — concrete
savings: objects not accessed 30 days → 40% cost reduction; 90 days
→ 81% cost reduction ("the latter approaches the cost of S3
Glacier!"); (B) colder classes like
S3 Glacier "impose minimum storage
durations and retrieval fees that can negate savings if you access
data more than you expected to" — the
cold-storage minimum-
duration tax. Yelp's third primitive is the new
Default Access Retention:
"data beyond the Default Access Retention period remains in S3 but is
gated behind a restrictive S3 bucket IAM policy that requires an
explicit process to gain access. The data consumer raises a Terraform
PR to request temporary access to restricted partitions and estimates
the associated cost using a dashboard built on S3 Inventory." Two
named benefits: (i) unexpected query patterns "do not reset the flow
of objects through Intelligent Tiering tiers" — storage cost is
guaranteed to decrease after the initial 30-day Intelligent Tiering
period; (ii) data consumers "acknowledge associated costs of
reading data from cold Intelligent Tiers, ensuring that it is
justified by the business value of the analysis" with explicit
disclosure of the structural failure mode: "for our largest tables,
full table scans could add significant S3 costs by accessing PBs of
data from cheap Intelligent Tiers like Archive Instant Access. This is
not obvious to users who are writing SQL to inspect data!" The post
is architecture-and-discipline voice: one large production number
(33% S3 cost reduction), the IT savings curve (40% / 81%), one batch-
oriented architecture diagram (linked but not described in detail),
one SQL transformation, one anchor link to the prior 2025-09-26 SAL-
at-scale post; no QPS, no fleet size, no dollar amounts, no breakdown
of which storage-class moves contributed how much to the 33%, no
Iceberg migration speed-up number, no number of tables migrated, no
number of partitions covered, no operator count, no time horizon for
the 33% reduction, no architecture diagram details for the access-
based retention dashboard.
Key takeaways¶
- Stakeholder discovery is the gating problem for data-platform efficiency wins. "In large analytics environments, data teams often struggle to answer deceptively simple questions, like who their stakeholders are and how their data is being used." Without stakeholder data, "teams rely on stakeholder conversations, documentation, and word of mouth — sources that quickly become out of date," which "makes it difficult for data owners to effectively steward their data and limits the support that platform teams can autonomously offer." Granular usage attribution is the unlock — once data owners can confidently answer "who consumes this table" they gain agency over retention, storage class, and migration decisions they otherwise can't make. The post structurally argues that observability of consumption is a precondition for storage cost optimisation — a granular-usage-attribution thesis (Source).
- Plotting partition value vs access time reveals consumer signatures with no labelling. "Given an analytics table partitioned by date (
dt=yyyy-mm-dd), we can plot partitions accessed against event time. Daily batch consumers present a diagonal line showing y=x. Today's job reads today's partition, yesterday's job reads yesterday's partition, and so on. Backfill events that scan many partitions present a vertical line. Ad hoc queries appear as scatter points with no clear pattern. We commonly see this type of access from roles attributed to internal engineering teams." The visualisation is the load-bearing primitive — shape recognition replaces explicit metadata. The diagonal-vs-vertical-vs-scatter distinction is cheap to compute (an aggregate-then-plot SQL query) and human-readable (any data owner can read the chart without ML or labelling). Canonical instance of the new concepts/partition-access-pattern concept and the new patterns/access-pattern-visualization-for-data-stewardship pattern (Source). - The IAM role accessing the data is the entity coordinate. "We include the entity accessing the data in the visualization. Since Yelp runs on AWS, our entities are IAM roles, with role names linked to services or teams." The IAM role doubles as a stable identity primitive (the SAL
requesterfield is an AWS IAM principal — see SAL line format) and a human-meaningful label (Yelp role names link to services/teams), avoiding the problem of attributing access to opaque IPs or ephemeral compute instances. Canonical instance of the new patterns/iam-role-attribution-from-s3-access-logs pattern — a refinement of the IAM-role grouping already documented in the 2025-09-26 SAL pipeline disclosure for cost attribution, generalised here to consumer attribution and to consumption-pattern fingerprinting (Source). - The implementation is a single SQL aggregation over compacted SAL. The post discloses the SQL transformation verbatim:
"Instead of functions
INSERT INTO table_usage_aggregated SELECT COUNT(1) AS ct, requester AS iam_role, "timestamp" AS event_time, KEY_TO_TABLE_NAME(key) AS table_name, KEY_TO_PARTITION_VALUE(key) AS partition_value FROM s3_server_access_logs_compacted WHERE bucket_name IN ('BUCKET1', 'BUCKET2') AND "timestamp" = 'yyyy/mm/dd' AND operation = 'REST.GET.OBJECT' AND key LIKE 'prefix_to_include%' GROUP BY 2, 3, 4, 5KEY_TO_TABLE_NAMEandKEY_TO_PARTITION_VALUE, you may join against your catalog metadata containing database name, table name, table location, and partition spec." The transformation rides on the existing 2025-09-26 SAL-to-Parquet compaction pipeline (systems/yelp-s3-sal-pipeline) — visualization is one consumer of the SAL data lake among several (the prior post named permission-debugging, cost attribution, incident response, and access-based retention as the others). "You may find it additionally useful to join against S3 Inventory to understand how usage relates to the defined storage classes for your analytics tables." (Source). - The platform-team-payoff thesis is a three-pillar structure: customer value × business value × platform efficiency. "Partition-level usage data would deliver meaningful improvements across the three main pillars of a platform team." Customer value: "a new feature to definitively track table consumers and usage patterns, giving data owners confident answers about who consumes their tables." Business value: "reduced S3 storage cost by 33%." Platform efficiency: "focus our migration efforts on active tables and partitions that would add the most customer value. This enabled the team to provide Apache Iceberg's read performance benefits to the most valuable use cases first." The 33% number is the headline; the migration prioritisation is the structural insight — usage data is simultaneously a cost-optimisation input and a migration-prioritisation input, sharing the same observability substrate. Canonical instance of the new patterns/usage-driven-migration-prioritization pattern (Source).
- S3 Intelligent Tiering is the cost-effective default for unpredictable access. "We default to S3 Intelligent Tiering when datasets have unpredictable access patterns. Its key advantage is that cost scales down automatically with reduced access, and there is no penalty if access patterns change." The savings curve is verbatim: "objects not accessed for 30 days decrease in cost by 40%; objects not accessed for 90 days decrease in cost by 81%. The latter approaches the cost of S3 Glacier!" The structural argument is risk-symmetric: IT auto-tiers down without operator action and auto-tiers back up if access resumes — the data owner does not need to model future access correctly to capture savings. This is the asymmetry contrast against cold storage classes which carry the minimum-duration tax (Source).
- Cold storage classes carry a structural minimum-duration + retrieval-fee tax that punishes uncertain access. "This is in contrast to cold storage classes (e.g., S3 Glacier) that impose minimum storage durations and retrieval fees that can negate savings if you access data more than you expected to." The verbatim framing matters: cold storage is bet against access (you pay if you access more than expected), where Intelligent Tiering is bet-neutral (cost moves with actual access). For data with unknown future access, the bet-neutral choice strictly dominates — IT extracts most of the cost reduction (81% at 90 days) without the retrieval-and-duration tax. Canonical instance of the new concepts/cold-storage-minimum-duration-tax concept — the cold-tier failure mode that explains why Yelp explicitly preferred IT over Glacier in the absence of access-pattern data (Source).
- Default Access Retention is Yelp's middle-ground primitive when neither deletion nor cold-tier is acceptable. "We introduced a middle ground for cases where data owners could not further expand deletion-based retention or cold storage due to uncertain future requirements: define an expected access window and implement access-based retention. Data beyond the Default Access Retention period remains in S3 but is gated behind a restrictive S3 bucket IAM policy that requires an explicit process to gain access. The data consumer raises a Terraform PR to request temporary access to restricted partitions and estimates the associated cost using a dashboard built on S3 Inventory. Certain approval levels are required based on the magnitude of the cost." This is structurally distinct from both deletion (data is gone) and cold-tier-by-default (data is still freely readable but more expensive to read). DAR keeps the data, makes it cheaper to keep (because IT can finish tiering down without disruption), and forces consumption decisions through a human-gated process with up-front cost disclosure. Canonical instance of the new concepts/default-access-retention concept and the new patterns/iam-policy-gated-cold-tier-access pattern (Source).
- DAR's first benefit: unexpected queries don't reset Intelligent Tiering's tiering clock. "Unexpected or accidental query patterns do not reset the flow of objects through Intelligent Tiering tiers. Storage cost is guaranteed to decrease after the initial 30 day period of Intelligent Tiering." The IT auto-tiering mechanism re-promotes objects to higher (more expensive) tiers when accessed; the DAR IAM gate prevents accidental access, which preserves the cost trajectory. The disclosure is precise: cost is guaranteed to decrease past the initial 30-day window, and the load-bearing assumption — that no one will accidentally read these objects — is enforced by the IAM policy itself, not by hope-based discipline (Source).
- DAR's second benefit: explicit cost acknowledgement before cold-tier scans, with disclosure that PB-scale full-table-scans on Archive Instant Access are non-obvious cost bombs. "Data consumers acknowledge associated costs of reading data from cold Intelligent Tiers, ensuring that it is justified by the business value of the analysis. If a consumer accesses a partition that is beyond the Default Access Retention period, their query fails with an Access Denied exception. For our largest tables, full table scans could add significant S3 costs by accessing PBs of data from cheap Intelligent Tiers like Archive Instant Access. This is not obvious to users who are writing SQL to inspect data!" The Access Denied exception is the forcing function — a SQL author who didn't think about cost is interrupted by a permission failure that points at the Terraform PR process and the S3 Inventory cost dashboard. The PB-scale Archive Instant Access full-table-scan is the canonical cost bomb DAR was designed to prevent: a query that "just works" against a hot table can cost millions if it ranges over a cold-but-accessible petabyte (Source).
- Iceberg migration prioritisation is a side-product of the same usage substrate. "By identifying the most frequently accessed data, the data platform team was able to focus our migration efforts on active tables and partitions that would add the most customer value. This enabled the team to provide Apache Iceberg's read performance benefits to the most valuable use cases first." The post's framing is "we set out to identify active tables and partitions to prioritize our adoption of Apache Iceberg, but we soon realized that the opportunity was larger" — usage attribution was originally an Iceberg-migration-prioritisation tool, and the storage-class / DAR / cost-reduction wins emerged as a second-order effect. The implication: single-purpose observability investments compound into multi-purpose platform substrates — the same
(table, partition, iam_role, event_time)aggregate serves Iceberg migration prioritisation, storage-class routing, retention policy, and incident response. Canonical instance of the new patterns/usage-driven-migration-prioritization pattern with the active-table-discovery as the prioritisation key (Source). - The architecture rides on the previously-canonicalised SAL pipeline; no new infrastructure was needed. The architecture diagram is described as "batch-driven" — a single SQL transformation aggregates compacted SAL into a usage table, then visualisations and dashboards are built on top. The substrate is the same SAL → Parquet → Athena chain documented in sources/2025-09-26-yelp-s3-server-access-logs-at-scale (the post explicitly anchors to that earlier disclosure: "check out our Engineering Blog post on S3 server access logs to learn how to enable this cost-effectively at scale"). The architectural lesson: observability investments amortise across follow-on use cases — the 2025-09-26 SAL pipeline was justified by storage-class retention and cost attribution, but it also unblocked partition access visualisation, Iceberg migration prioritisation, DAR enforcement, and stakeholder discovery, all without new substrate (Source).
Architectural numbers + operational notes (from source)¶
- Headline result: "reducing S3 storage cost by 33%" — Yelp's petabyte-scale analytics data lake. Time horizon, baseline, and decomposition (how much from deletion-based retention vs storage-class moves vs DAR adoption) not disclosed.
- S3 Intelligent Tiering savings curve: "objects not accessed for 30 days decrease in cost by 40%; objects not accessed for 90 days decrease in cost by 81%."
- Glacier comparison: "the latter [81%] approaches the cost of S3 Glacier!" — Yelp's quoted argument against using Glacier directly when IT captures most of the savings without the minimum-duration / retrieval-fee tax.
- DAR enforcement mechanism: restrictive S3 bucket IAM policy; consumer raises a Terraform PR to gain temporary access; approval levels depend on the magnitude of the cost (S3 Inventory dashboard estimates the cost based on the data in scope and current storage classes).
- DAR failure mode: "their query fails with an Access Denied exception" — the forcing function for cost acknowledgement.
- Substrate: rides on the Yelp S3 SAL pipeline (canonical 2025-09-26). Single SQL aggregation. SQL operation filter:
operation = 'REST.GET.OBJECT'. UDF stubs:KEY_TO_TABLE_NAME(key)andKEY_TO_PARTITION_VALUE(key)— alternatively, join against catalog metadata. - Visualisation primitives: two charts named — (a) "Partitions Accessed Vs Time" (the time-vs-partition signature view) and (b) "Partitions Accessed" (the historical per-partition access count over a window — "what are the access patterns for table X in the last N months?"). Image URLs are referenced but content not described.
- Entity granularity: IAM role (the SAL
requesterfield). Linked to services / teams via Yelp role-naming convention. - Three signature shapes: diagonal y=x line (daily batch); vertical line (backfill); scatter (ad hoc inspection — typically internal engineering).
- Three platform-team pillars: customer value, business value, platform efficiency — explicit.
- Iceberg migration prioritisation: "focus our migration efforts on active tables and partitions" — no number of tables migrated, no migration speed-up disclosed.
- Future work: "Inspired by the success of this effort, we are investing in other areas of our data infrastructure to further enhance lineage and granular usage attribution."
- Acknowledged contributors: Rishi Madan (development); Yelp Infrastructure Security team — Vincent Thibault, Quentin Long, Nurdan Almazbekov (enabling SAL across Yelp's AWS infrastructure).
- Nothing disclosed about: dollar baseline / saved amount, total table count, table size distribution, IT vs DAR cost split, the Terraform PR review SLA, the cost-magnitude approval thresholds, the S3 Inventory cost-estimation formula, the visualisation tooling (BI tool name, dashboard platform), the migration backlog size, the schedule of the SQL aggregation (presumed daily based on the SAL pipeline's daily Tron compaction cadence), per-team consumer counts, the pre-DAR baseline retention period, the post-DAR access window typical length, what proportion of tables were already on IT before this effort, what proportion are on DAR after.
Systems extracted¶
New wiki pages:
- systems/yelp-partition-access-visualization — Yelp's partition-access visualisation tooling (canonical home). Sits on top of the systems/yelp-s3-sal-pipeline as a downstream consumer — produces the
(table, partition, iam_role, event_time)aggregate and the two named charts (Partitions Accessed Vs Time and Partitions Accessed). Usage substrate for stakeholder discovery, storage-class routing, Default Access Retention decisioning, and Apache Iceberg migration prioritisation. Canonical instance of patterns/access-pattern-visualization-for-data-stewardship. - systems/aws-s3-intelligent-tiering — AWS storage class with automatic auto-tiering between Frequent / Infrequent / Archive Instant Access tiers based on observed access age. "Cost scales down automatically with reduced access, and there is no penalty if access patterns change." The Yelp post canonicalises the savings curve: 30 days no-access → 40% off; 90 days no-access → 81% off ("approaches the cost of S3 Glacier!"). Yelp's default storage class for analytics datasets with unpredictable access patterns. The Archive Instant Access tier specifically is named as the structural source of the "PBs of data from cheap Intelligent Tiers" full-table-scan cost bomb that Default Access Retention guards against.
- systems/aws-s3-glacier — AWS cold-storage class family. Cited in the post as the comparison point that motivates IT-by-default — "cold storage classes (e.g., S3 Glacier) impose minimum storage durations and retrieval fees that can negate savings if you access data more than you expected to." Canonical wiki instance of the cold-storage minimum-duration tax.
Extended (cross-link added):
- systems/aws-s3 — adds this source; reinforces SAL as a multi-purpose observability substrate. Adds canonical references to the new IT and Glacier pages and to the partition-access-visualisation pattern. Storage-class strategy disclosure (IT default + DAR middle ground + Glacier comparison) joins the existing storage-properties section.
- systems/apache-iceberg — adds this source; canonical instance of patterns/usage-driven-migration-prioritization applied to Iceberg adoption — Yelp prioritised migration toward most-accessed tables and partitions to deliver Iceberg read-performance wins to highest-value consumers first.
- systems/s3-inventory — adds this source; second canonical seen-in. Inventory's role here: (a) ⋈ SAL access logs to tag accessed-vs-unaccessed at storage-class granularity; (b) cost-estimation dashboard for the DAR Terraform PR cost acknowledgement gate.
- systems/yelp-s3-sal-pipeline — adds this source; canonical instance of patterns/access-pattern-visualization-for-data-stewardship consuming the pipeline output. Second downstream consumer documented for the SAL pipeline, reinforcing that the 2025-09-26 substrate amortises across multiple use cases.
- systems/aws-iam — adds note that the IAM role is the canonical entity coordinate for usage attribution in S3-centric data lakes (via the SAL
requesterfield), generalising the IAM-role-as-subject role from access control to consumption fingerprinting. - companies/yelp — sixth axis on the wiki: data-platform / storage-cost-engineering. Builds on the 2025-09-26 SAL-axis foundation — same substrate, new use case.
Concepts extracted¶
New wiki pages:
- concepts/partition-access-pattern — a consumption signature visible when partitions accessed (y) are plotted against access-event time (x): diagonal y=x lines indicate daily batch consumers; vertical lines indicate backfills (one job, many partitions); scatter indicates ad hoc inspection. The visualisation makes consumer roles legible without manual labelling. Canonical wiki disclosure.
- concepts/granular-usage-attribution — per-table per-partition per-entity tracking of data access. The Yelp post argues this is the gating observability primitive for data-platform efficiency wins — without it, retention / cold-tiering / migration decisions are blocked by stakeholder uncertainty. Distinct from coarser per-table or per-bucket access counts because the per-partition + per-entity granularity enables both shape recognition (partition-access-pattern) and accountable cost attribution.
- concepts/cold-storage-minimum-duration-tax — the structural failure mode of cold-storage classes (S3 Glacier, S3 Glacier Deep Archive) when access patterns are uncertain: minimum storage durations + per-access retrieval fees combine to negate savings if you read the data more than you expected to. Yelp's argument for IT-by-default rests on this property. Bet-asymmetric (loses money on uncertainty) vs IT's bet-symmetric tiering. Generalises beyond AWS — same tax exists for Azure Archive, GCS Archive.
- concepts/default-access-retention — Yelp's named middle-ground primitive between deletion-based retention and cold-tier-by-default. Data outside the access window stays in S3 but behind a restrictive bucket IAM policy that gates access through a Terraform PR + cost-acknowledgement workflow. Two named benefits: (1) unexpected queries don't reset IT's tiering clock; (2) consumers acknowledge cost before cold-tier scans, preventing PB-scale Archive Instant Access cost bombs from hot SQL. Canonical wiki disclosure.
Extended (cross-link added):
- concepts/s3-server-access-logs — second canonical use case (after access-based retention from 2025-09-26): partition-access visualisation. Reinforces SAL's role as the load-bearing substrate for multi-tenant fleet usage attribution at AWS-native cost.
Patterns extracted¶
New wiki pages:
- patterns/access-pattern-visualization-for-data-stewardship — the visualisation-as-tooling pattern. SQL-aggregate
(table, partition, iam_role, event_time)over compacted access logs, plot partition value vs event time coloured by IAM role; signature shapes (diagonal / vertical / scatter) become the metadata. Replaces stakeholder conversations / documentation / word-of-mouth as the source of "who consumes this table" with an objective, continuously-updated artifact. Canonical Yelp instance. - patterns/iam-role-attribution-from-s3-access-logs — using the SAL
requesterfield (an IAM principal ARN) as the entity coordinate for usage fingerprinting. Pre-extracted in the 2025-09-26 SAL pipeline for cost attribution; canonicalised here as the general pattern for consumer attribution and pattern recognition. Stable identity (services/teams own roles), human-meaningful labelling (Yelp role names map to teams), avoids ephemeral-IP / ephemeral-instance attribution failure modes. - patterns/iam-policy-gated-cold-tier-access — Default Access Retention's enforcement primitive. A restrictive S3 bucket IAM policy blocks reads to data beyond an access window; consumer must raise a Terraform PR with cost acknowledgement (S3 Inventory–derived cost dashboard) to gain temporary access. Two structural benefits: prevents accidental queries from resetting IT's tiering clock; forces explicit cost acknowledgement before PB-scale cold-tier scans. Distinct from deletion (data preserved) and from cold-tier-by-default (data preserved + cheaper to keep + harder to read). Canonical Yelp instance.
- patterns/usage-driven-migration-prioritization — using usage attribution data to prioritise migration backlogs toward most-accessed tables and partitions. Same observability substrate that drives storage-class decisions also drives migration ordering — the highest-value migration wins land first because they're identified by the same signal. Canonical Yelp instance: prioritising Apache Iceberg adoption toward active tables to deliver read-performance wins to most-valuable consumers first.
Caveats¶
- Yelp is a Tier-3 source; this post passes scope on infrastructure + storage architecture + production cost reduction grounds despite being on the platform-team-blog rim of the engineering blog space. Per AGENTS.md: "Tier-3: Skip unless the post explicitly covers: distributed systems internals, scaling trade-offs, infrastructure architecture, production incidents, storage/networking/streaming design, or similar." This post covers four of those (storage design, scaling trade-offs, infrastructure architecture, real production cost reduction).
- The 33% reduction is a single aggregate number, with no breakdown of how much came from deletion-based retention expansion, how much from IT-by-default migration, and how much from DAR adoption. The post argues the three contributors qualitatively but doesn't decompose them quantitatively — a reader can't independently rank the levers.
- No baseline or absolute dollar figures. Petabyte-scale data lake is named, but pre-effort cost and post-effort cost are not. "Reducing S3 storage cost by 33%" is the only number — the post is structurally a methodology disclosure, not a cost retrospective.
- No time horizon for the 33% reduction. The migration to IT for objects with newly-discovered access patterns implies a multi-quarter timeline (IT itself takes 90 days to land the 81% deeper-tier savings), but no explicit duration is given.
- DAR enforcement specifics are sparse. "Certain approval levels are required based on the magnitude of the cost" is the only disclosure of the approval taxonomy — no thresholds, no SLAs, no operator headcount, no PR cycle time. "S3 Inventory–based dashboard" names the cost-estimation surface but doesn't disclose its formula or precision.
- Iceberg migration prioritisation outcome is qualitative only. "Focus our migration efforts on active tables and partitions" and "provide Apache Iceberg's read performance benefits to the most valuable use cases first" — but no numbers on table count migrated, partition coverage, latency improvement at the table-grain, or migration speed-up vs the unprioritised counterfactual.
- Visualization tooling is not named. Two charts are linked but not described in detail; no BI platform, no dashboard URL, no code for the partition-access-pattern detection.
- The SAL pipeline this rides on is opaque to readers without the 2025-09-26 anchor post. The post relies on the prior disclosure for substrate (
s3_server_access_logs_compacted, Tron, Athena, Glue, partition projection, idempotent INSERTs) — a reader without that context will not understand how the SQL aggregation actually runs cost-effectively at TiB-per-day SAL volume. - Generalisability to non-time-partitioned tables is implicit. The y=x diagonal signature requires time-based partition keys (e.g.
dt=yyyy-mm-dd). Hash-partitioned or category-partitioned tables would not produce the same visual signature — the technique's reach beyond date-partitioned analytics tables is not addressed. - Implicit dependency on Yelp's same-account same-region SAL configuration (the AWS constraint canonicalised in 2025-09-26). Multi-account or cross-region access patterns require additional substrate, not discussed here.
- The Default Access Retention failure-mode disclosure is one-sided. The post argues the value of forcing explicit access for cold-tier reads but does not discuss false-positive costs — legitimate users blocked by an Access Denied for legitimate-but-unanticipated queries, the operational cost of the Terraform PR review backlog, the political cost of a data steward gating access to historical data. These tradeoffs are real and absent.
Source¶
- Original: https://engineeringblog.yelp.com/2026/05/partition-access-visualizations.html
- Raw markdown:
raw/yelp/2026-05-21-how-partition-access-visualizations-reduced-our-data-lake-s3-3ea1f516.md
Related¶
- companies/yelp — sixth Yelp axis (data-platform / storage-cost-engineering); builds on 2025-09-26 SAL-axis substrate.
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale — the load-bearing substrate this post rides on. Canonical Yelp SAL pipeline disclosure.
- systems/yelp-s3-sal-pipeline — substrate.
- systems/yelp-partition-access-visualization — canonical home for the visualisation tooling.
- systems/aws-s3-intelligent-tiering — Yelp's default storage class for analytics tables.
- systems/aws-s3-glacier — comparison point; cold-storage minimum-duration tax canonical instance.
- systems/apache-iceberg — migration-prioritisation target for the usage data.
- systems/s3-inventory — DAR cost-estimation source.
- concepts/partition-access-pattern — canonical concept extracted from this post.
- concepts/granular-usage-attribution — canonical concept extracted from this post.
- concepts/cold-storage-minimum-duration-tax — canonical concept extracted from this post.
- concepts/default-access-retention — canonical concept extracted from this post.
- patterns/access-pattern-visualization-for-data-stewardship — canonical pattern.
- patterns/iam-role-attribution-from-s3-access-logs — canonical pattern.
- patterns/iam-policy-gated-cold-tier-access — canonical pattern.
- patterns/usage-driven-migration-prioritization — canonical pattern.