Skip to content

SYSTEM Cited by 1 source

Yelp S3 SAL pipeline

Yelp's internal pipeline for operationalising S3 Server Access Logs (SAL) at fleet scale. The pipeline ingests "TiBs of S3 server access logs per day", compacts the raw-text objects into columnar Parquet on a daily cadence, exposes the result via Athena for debugging + cost attribution + incident response, and runs a weekly access-based retention job that joins S3 Inventory with a week of SAL to tag unused objects for lifecycle expiration.

Role

Canonical Yelp system for S3-native object-access logging at scale. Preferred over AWS's per-event-priced CloudTrail Data Events ("orders of magnitude higher!" per the post) on cost grounds. Consumed by:

  • Permission-debugging queries — bucket + timestamp + key slices to confirm identity / access results.
  • Cost attributionregexp_extract(requester, '.*assumed-role/([^/]+)', 1) group-by to find the IAM role generating the call volume.
  • Incident response — slice by IP / user-agent / requester role to size a compromise.
  • Data retention — access-based deletion of unused prefixes.

(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Architecture

  1. Log generation — each monitored bucket has a destination bucket for SAL, same-account + same-region (AWS constraint; also eliminates cross-region data-transfer cost). Format: PartitionedPrefix with EventTime delivery — Yelp migrated off the default SimplePrefix after the accumulated flat- namespace became Athena-unqueryable.
  2. Daily compaction on Tron — converts the previous day's raw SAL objects to Parquet via Athena INSERT queries. Idempotent via self-LEFT-JOIN on requestid (patterns/idempotent-athena-insertion-via-left-join).
  3. Lambda enumerator — keeps the Glue enum-type partition list for bucket_name up-to-date. Reads from an SQS queue populated by periodic EventBridge rules; queue is necessary "to avoid concurrent reads and writes" on the partition list.
  4. Object-tagging lifecycle — after successful insertion, the source raw SAL objects are tagged (patterns/object-tagging-for-lifecycle-expiration); each destination bucket has a tag-based lifecycle policy that deletes tagged objects. High-volume buckets use S3 Batch Operations PutObjectTagging; low-volume buckets bypass Batch Ops (flat $0.25 / bucket / job) and tag directly.
  5. Weekly access-based table — joins S3 Inventory with a week of SAL at prefix granularity (equality-join on the extracted prefix, not LIKE"~70,000 rows… from over 5 minutes to just 2 seconds"). Drives access-based retention of unused objects.

Glue / Athena schema

  • Partitioning: partition projection via 'storage.location.template'='s3://<dest>/<account>/<region>/${bucket_name}/${timestamp}'. bucket_name uses enum type (enumerated, kept in sync by the Lambda). timestamp uses yyyy/MM/dd granularity — "a day's worth of logs" is the common slice, finer granularity would cause over-partitioning.
  • Regex input.regex — the AWS-documented SAL regex, with Yelp's modification: the user-controlled tail fields are wrapped in an optional non-capturing group (?:<rest>)? so the first seven non-user-controlled fields always parse.
  • Cross-account: a single querying AWS account registers Glue Data Catalogs from every source account via ListDataCatalogs. Table / database names embed account + region — "occasionally backfires, when a query produces 0 results and you realize that the bucket exists in a different account or region."

Scale and outcomes

  • TiBs of SAL per day (fleet-wide).
  • 85% storage reduction post-compaction.
  • 99.99% object-count reduction post-compaction.
  • < 0.001% of SAL arrives > 2 days late (measured straggler tail; acceptable because retention windows are much longer and deletion is at prefix granularity).

Comparison surfaces

  • vs CloudTrail Data Events — $1 per million events, "orders of magnitude higher". Yelp's compaction-of-native-SAL is a cost-efficient substitute for object-level tracking.
  • vs managed-partition Hive metastore — Glue partition projection avoids MSCK REPAIR / ALTER TABLE refresh churn and metastore-lookup query-planning latency. (patterns/projection-partitioning-over-managed-partitions)
  • vs per-object DELETE API — object-tagging + lifecycle policy is the only scalable per-object-deletion primitive at Yelp's object-count scale (patterns/object-tagging-for-lifecycle-expiration).

Known limits / caveats

  • Straggler SAL logs are ignored after the source objects expire; could be re-inserted with extra work.
  • enum-type partition projection has a 1M partition cap if WHERE doesn't constrain it — Yelp uses injected type for access-based tables where the caller specifies bucket + date.
  • Backup buckets (CDC targets for unchanged tables) can look "unused" to access-based retention; flagged as future exemption axis.
  • Athena is shared; TooManyRequestsException retry is a normal part of the compaction job's operation loop (concepts/athena-shared-resource-contention).

Future work

  • Forward SAL to Splunk for engineer-facing troubleshooting (requires volume reduction + shorter retention).

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — full architectural disclosure of the pipeline, including Parquet-compaction numbers, Glue partition-projection choice, idempotent insertion shape, object-tagging deletion discipline, SAL regex hazards (user-controlled fields; URL-encoding idiosyncrasy on BATCH.DELETE.OBJECT / S3.EXPIRE.OBJECT operations), and the access-based retention join over inventory + SAL.
Last updated · 476 distilled / 1,218 read