Skip to content

YELP 2025-09-26 Tier 3

Read original ↗

Yelp — S3 server access logs at scale

Summary

Yelp Engineering post (2025-09-26) by the SRE / Storage team on operationalising S3 Server Access Logs (SAL) at Yelp's fleet scale — "TiBs of S3 server access logs per day" across many buckets and multiple AWS accounts. Core architectural move: a daily raw-to-Parquet compaction job run on Tron (Yelp's in-house batch processing system) that converts the continuous stream of tiny raw-text SAL objects into a small number of columnar Parquet objects queried via Athena. Reported compaction outcomes: 85% storage reduction and 99.99% object-count reduction. Once the data is in Parquet, a second- tier access-based table joins S3 Inventory with a week of SAL to determine unused prefixes, driving access-based object deletion at prefix granularity. Load-bearing subsystem disclosures span log-format partitioning (S3 PartitionedPrefix + EventTime delivery vs the default flat SimplePrefix), Glue partition projection with enum type (vs managed partitions' MSCK REPAIR overhead), idempotent Athena insertions via LEFT JOIN ... target.requestid IS NULL, object-tagging- driven lifecycle expiration (the only scalable per-object delete primitive when per-object DELETE API calls don't scale), Athena's shared- resource TooManyRequestsException contention model, user-controlled log fields as regex hazard (double-quote injection, shell-injection, SQLi-payloads in request_uri / user_agent / referrer break naive regex), and the S3 key URL-encoding idiosyncrasy (most operations encode keys twice; lifecycle operations encode once; naive url_decode(url_decode(key)) can over-decode user-controlled percent characters in keys). Canonical first-party retrospective on running AWS-native S3 object-access logging at production scale — every paragraph names a concrete operational problem and the fix.

Key takeaways

  • Raw SAL at scale is unqueryable. Continuous stream of tiny text objects at the destination bucket → S3 API rate-limits on Athena reads → "impossible to query in Athena due to S3 API rate limits. As there's no way to slow down reads from Athena, data partitioning is one of the ways to solve the issue." Compaction to Parquet yields 85% storage reduction + 99.99% object-count reduction. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • CloudTrail Data Events are the Other Option — and it's orders-of-magnitude more expensive. "AWS provides a ready solution to log object-level events as CloudTrail Data Events, but you are charged per data event so it's substantially more expensive: $1 per million data events — that could be orders of magnitude higher!" SAL-to-Parquet compaction is Yelp's cost-efficient alternative to CloudTrail for object-level access logging. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • PartitionedPrefix + EventTime delivery replaces SimplePrefix. Default SAL TargetObjectKeyFormat is SimplePrefix ([TargetPrefix][YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]), a flat namespace that accumulates to Athena-unqueryable volumes. PartitionedPrefix produces [TargetPrefix][SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/..., partitioned for Athena pruning. Yelp made it the Terraform module default and migrated existing targets. Delivery option: EventTime ("attributes the log to the event time") over the alternative log-arrival-time option. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Glue partition projection with enum type beats managed partitions. Managed partitions require MSCK REPAIR or ALTER TABLE to refresh and suffer query-planning latency as partition count grows. Partition projection names partitions from the known prefix template ('storage.location.template'='s3://<dest>/0123.../${bucket_name}/${timestamp}'). Yelp picked enum for bucket_name (enumerated, list of all buckets kept in Glue by a Lambda that reads an SQS queue populated by EventBridge rules) + day-granularity for timestamp ("encompasses an entire day, that is yyyy/MM/dd, which accelerates query pruning time because we typically query a day's worth of logs" — finer granularity would cause over-partitioning). enum cap: 1,000,000 partitions over long time windows unless constrained in WHERE. Access-based tables use injected type instead (user must name value in WHERE). (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Regex parsing of SAL is hostile-input territory. The first seven fields (file_bucket, remoteip, requester, requestid, operation, key, size — roughly) are not user-controlled; the later fields (request_uri, referrer, user_agent) are arbitrary bytes, unencoded. "Any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." Concrete hostile cases disclosed: a curl -H 'Referer: "a b" "c"' produces log fields with unmatched quote nesting; SQLi probes and shell-injection ("() { ignored; }; echo Content-Type: ...") in user_agent break naive regex. Yelp's workaround: wrap the user-controlled fields in an optional non-capturing group (?:<rest>)? and always accept the first seven fields; regex ends .*$ to tolerate future SAL schema additions. An empty row in the parsed output is the failure signal"if a row is empty, then the regex has failed to parse that row— so don't ignore those!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Idempotent Athena insertion via self-LEFT JOIN. Yelp's canonical insertion shape:

INSERT INTO "catalog_target"."database"."table_region"
SELECT *
FROM "catalog_source"."database"."table_region" source
LEFT JOIN "catalog_target"."database"."table_region" target
ON target.requestid = source.requestid
AND bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
WHERE bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
AND target.requestid IS NULL

Duplicating the bucket_name / timestamp filters in both ON and WHERE is "prudent because queries start taking longer to run after each insertion." For tables without a unique column, Yelp checksums row values. Net property: the compaction job is retry-safe — any failure is recovered by re-running over a day's data with no duplicate writes. Canonicalised as patterns/idempotent-athena-insertion-via-left-join. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Object tagging is the only scalable per-object deletion primitive. "That's the only scalable way to delete per object without needing to modify lifecycle policy each time or issuing delete API calls." Shape: after a successful Athena insertion, tag every source SAL object for expiration; the bucket's lifecycle policy deletes all tagged objects. Per- object DELETE API requests don't scale even with batch- delete. S3 Batch Operations doesn't support Delete as an action — but it does support PutObjectTagging, so Yelp uses Batch Operations to issue the tagging at scale. Caveat: S3 Batch Operations carries a flat $0.25 per job per bucket fee — the biggest cost contributor for buckets with low SAL volume. Yelp's policy: "we directly tag objects for most buckets that generate low order access logs" (bypassing Batch Ops) and only use Batch Ops on the high-volume buckets. Canonicalised as patterns/object-tagging-for-lifecycle-expiration. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Straggler SAL logs are ignored. SAL delivery is best-effort (best-effort log delivery) and Yelp measured: "less than 0.001% of SAL logs arrive more than a couples days later. For example, we witnessed some logs arriving 9 days later." Design choice: ignore stragglers so the compaction job delivers business value in a timely fashion. They can be inserted later, after the tagged source objects have been expired. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Access-based retention via inventory ⋈ SAL. Weekly access- based tables compute prefix access by joining S3 Inventory with a week of SAL. The prefix is defined as "only immediate keys under it, segmented by slash (/), and removing trailing slash"array_join(slice(split(rtrim(key, '/'), '/'), 1, cardinality(split(rtrim(key, '/'), '/')) - 1), '/'). Equality joins beat LIKE joins by a dramatic margin — "in a query over ~70,000 rows, simply switching from a LIKE operator to an equality (=) operator reduced execution time from over 5 minutes to just 2 seconds—a dramatic improvement" — because LIKE forces Athena's distributed planner into broadcast cross-join. Yelp's comfort with SAL best-effort is predicated on deletion at prefix granularity: "deletions are based on prefixes—so missing all logs for a given prefix would only occur for truly inactive data." Canonicalised as patterns/s3-access-based-retention. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Athena is a shared resource — retry-on-TooManyRequestsException. "Athena is a shared resource so a query may be killed any time due to the cluster being overloaded, or occasionally hitting S3 API limits. Given our scale, we would encounter such errors on a regular basis." Yelp addressed this by reducing concurrent queries + requesting per-account / per-region DML query limit increases. Queue processing modelled as "asynchronous functions" to exploit the isolation-per-bucket of insertions. Canonicalised as concepts/athena-shared-resource-contention. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • S3 Batch Operations manifests: header row + URL-encoding gotchas. Athena query results include a header row that S3 Batch Operations interprets as a bucket name "causing job failures" — Yelp recreates manifest files in memory without headers. Object keys must be URL-encoded to quote_plus(key, safe="/") equivalence before passing to Batch Ops. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • S3 key URL-encoding idiosyncrasy. Most operations double-url-encode the key field in SAL logs; "for some operations, such as BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT, the key is url-encoded once"Yelp's working theory is that lifecycle operations come from an internal service. Naive fallback to single-decoding is unsafe: key foo-%2525Y decodes unambiguously twice, but the actual object could be either foo-%Y or foo-%25Y. Canonicalised as concepts/url-encoding-idiosyncrasy-s3-keys. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

  • Glue Data Catalog cross-account pattern. A single "querying" AWS account registers Glue Data Catalogs from every source account (via ListDataCatalogs), letting one IAM role run queries FROM "catalog"."database"."table_region" across the fleet without role-pivoting. Requires cross-account Glue permissions + S3 read on each source bucket. Yelp embeds account and region into table / database names "to be explicit; that occasionally backfires, when a query produces 0 results and you realize that the bucket exists in a different account or region!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Operational numbers canonicalised

  • TiBs of S3 server access logs generated per day (fleet-wide, scale-setter for the compaction problem).
  • 85% storage reduction + 99.99% object-count reduction via raw-text-to-Parquet compaction (headline outcome).
  • < 0.001% of SAL logs arrive more than a couple of days late (measured straggler tail; max observed ~9 days).
  • 1 M partition cap on enum-type partition projection if not constrained in WHERE.
  • $0.25 flat per-bucket-per-job S3 Batch Operations fee — the biggest cost contributor for low-SAL-volume buckets (drives the direct-tag vs batch-ops dispatch rule).
  • $1 per million data events for CloudTrail Data Events (the rejected alternative, cited verbatim as "orders of magnitude higher!").
  • ~70,000 rows / 5 min → 2 sec — equality-join vs LIKE-join speedup disclosed for the inventory ⋈ SAL access-based-table computation.
  • 11-Nov 2023 — AWS shipped date-based partitioning for SAL (the unblocker Yelp cites for why object-level logging became tractable).

Systems

  • systems/aws-s3 — the storage layer being monitored + the destination of the logs + the substrate of both raw and compacted data. Same-region same-account destination bucket constraint surfaced ("must exist in the same account, as well as region to eliminate cross-region data charges") and S3 Batch Operations + S3 Inventory added as named S3-family systems.
  • systems/amazon-athena — the query engine over Parquet- compacted SAL; the canonical shared-resource contention surface of this post; $path pseudo-column for extracting S3 object locations; GetQueryRuntimeStatistics API for post-query row counts.
  • systems/aws-glue — Data Catalog + table metadata store; partition projection (enum / injected); Terraform-managed cross-account catalog registration.
  • systems/apache-parquet — the compacted-output format; "It includes metadata that allows skipping row groups or pages based on filter criteria which reduces data scanned."
  • systems/s3-batch-operations — the per-bucket batch job primitive used for PutObjectTagging fanout. Flat $0.25 per job per bucket; doesn't support Delete.
  • systems/s3-inventory — the daily object-listing used in the access-based retention join.
  • systems/yelp-s3-sal-pipeline — the named Yelp system (this post's subject): daily Tron job that compacts SAL to Parquet, tags source objects for lifecycle expiration, and maintains access-based tables for prefix retention.
  • systems/tron — Yelp's in-house batch processing system (open-source: github.com/Yelp/Tron); orchestrates the daily compaction and the weekly access-based-table build.

Concepts

  • concepts/s3-server-access-logs — the AWS primitive ("best- effort" delivery; log line format with 25+ fields; destination- bucket-in-same-region-and-account constraint; SimplePrefix vs PartitionedPrefix vs EventTime delivery).
  • concepts/partition-projection — Glue/Athena partitioning primitive that avoids managed-partition MSCK REPAIR / metastore lookup overhead; enum vs injected types; 1M partition cap on enum if unconstrained.
  • concepts/best-effort-log-delivery — the reliability model Yelp accepts (measured < 0.001% stragglers at 2-day threshold)
  • the design knobs that make it acceptable (retention windows longer than the measured tail; deletion at prefix granularity).
  • concepts/athena-shared-resource-contention — Athena's shared-cluster model; TooManyRequestsException; per-account / per-region DML concurrency quota.
  • concepts/user-controlled-log-fields — the regex-hazard shape: when arbitrary user input (HTTP headers, URIs) lands unescaped in log fields, every regex has a breaking counter-example.
  • concepts/url-encoding-idiosyncrasy-s3-keys — most S3 operations double-url-encode keys in SAL; lifecycle operations single-encode; naive url_decode(url_decode(key)) can over-decode user-controlled percent characters.

Patterns

  • patterns/raw-to-columnar-log-compaction — convert a continuous stream of tiny raw-text log objects into a small number of columnar objects (Parquet) in a daily batch. Compression + columnar predicate-pushdown + object-count collapse drive both cost and query-performance wins. Canonical Yelp instance: 85% storage + 99.99% object-count reduction.
  • patterns/object-tagging-for-lifecycle-expiration — to delete millions of objects without modifying the lifecycle policy per-batch or issuing per-object DELETE, tag each object and let the bucket's tag-based lifecycle policy expire them. Composes with S3 Batch Operations's PutObjectTagging for high-volume buckets and direct per- object tagging for low-volume buckets (to avoid the flat $0.25 per-job fee).
  • patterns/idempotent-athena-insertion-via-left-join — make an INSERT ... SELECT retry-safe by self-LEFT-JOINing on the target's unique column (requestid here) and filtering target.requestid IS NULL; duplicate partition-column filters in both ON and WHERE to keep the join planner pruning correctly as the target grows.
  • patterns/projection-partitioning-over-managed-partitions — choose Glue partition projection over managed partitions when you have a known prefix template; avoids MSCK REPAIR / ALTER TABLE refresh churn and metastore-lookup query-planning latency as partition count grows.
  • patterns/s3-access-based-retention — join S3 Inventory with SAL at prefix granularity to identify unused prefixes; delete at prefix granularity so best-effort SAL delivery doesn't over-delete active data. Cheaper + safer than blanket TTL-based lifecycle on multi-tenant buckets.

Architecture highlights (verbatim content references)

Compaction architecture

"Tron, Yelp's in-house batch processing system, runs daily and converts the previous day's logs to parquet format via Athena insertion queries. A Lambda is used to enumerate all possible bucket names because we opted to use a projection partition using enum type."

Queue processing (compaction)

Steps per bucket, per day:

  1. Location query — retrieve source-object S3 paths for later tagging: SELECT split_part(substr("$path", 6), '/', 1), substr("$path", 7 + length(...)) FROM ... GROUP BY "$path". Locations gathered first "because it saves us from expiring uninserted logs."
  2. Idempotent insertion (LEFT JOIN shape above).
  3. Count verification — compare inserted rows against source row count via GetQueryRuntimeStatistics (for initial insertions) or distinct-count on target (for re-insertions, because duplicates won't enter).
  4. Tag for lifecycle expiration — build an S3 Batch Operations manifest (header-stripped, quote_plus(key, safe="/") URL-encoded) OR directly tag objects (for low-volume buckets, to avoid the $0.25 per-job fee).

Access-based retention flow

Weekly:

  1. Join S3 Inventory with a week of SAL at prefix granularity (equality join on extracted prefix; not LIKE).
  2. Build an access-based table keyed on (bucket, prefix, last_access).
  3. Join back to S3 Inventory to translate accessed-prefixes to full S3 object keys for Batch Ops manifest.
  4. Tag-and-expire unused objects via the same patterns/object-tagging-for-lifecycle-expiration pipeline.

Caveats disclosed

  • Straggler SAL is ignored. Design choice — re-insertion after tagged objects have expired would require re-fetching. Acceptable because deletion is at prefix granularity.
  • No manual intervention target — the post emphasises the robustness work: "Most of the effort was spent on making the process robust so that no manual intervention is needed for them."
  • Regex breakage is inevitable. "Essentially, any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." Empty parsed rows are the failure signal.
  • Backup buckets confound access-based deletion. "Backup buckets that store incremental changes (CDC) to a table that may appear to contain objects that haven't been accessed, when in fact it indicates that the data has not changed for the table." Future exemption axis.
  • Future Splunk forwarding. Yelp wants to forward SAL to Splunk for engineer-facing troubleshooting, "reducing data volume to meet the target daily ingestion limit and applying shorter retention periods." Not shipped.
  • $0.25 per-bucket-per-job Batch Ops fee dominates for low-SAL-volume buckets; hence the direct-tag-vs-batch-ops dispatch rule.
  • Cross-account catalog-name embedding occasionally causes 0-result queries on wrong-account/region; Yelp accepts it as explicitness-over-brevity cost.

Scope-decision rationale

Tier-3 decisively on-scope. First-party production retrospective on AWS-native S3 object-access logging at fleet scale (TiBs/day). Load-bearing architectural disclosures across: (1) storage / data-engineering (raw-to-Parquet compaction, 85% + 99.99% reductions), (2) query engine (Athena partition projection, shared-resource contention, idempotent self-LEFT-JOIN insertion), (3) lifecycle management (object-tagging for scalable per-object deletion, access-based retention via inventory ⋈ SAL), (4) log-format parsing hazards (user-controlled fields, URL-encoding idiosyncrasy). Zero product-marketing content; concrete code snippets and numbers throughout. Opens Yelp's storage / data-engineering axis on the wiki, distinct from the 2025-02-04 LLM / search-serving-infra axis, the 2025-02-19 financial-systems / data-platform axis, and the 2025-04-15 corporate-security / networking axis.

Source

Last updated · 476 distilled / 1,218 read