SYSTEM Cited by 1 source
Yelp S3 SAL pipeline¶
Yelp's internal pipeline for operationalising S3 Server Access Logs (SAL) at fleet scale. The pipeline ingests "TiBs of S3 server access logs per day", compacts the raw-text objects into columnar Parquet on a daily cadence, exposes the result via Athena for debugging + cost attribution + incident response, and runs a weekly access-based retention job that joins S3 Inventory with a week of SAL to tag unused objects for lifecycle expiration.
Role¶
Canonical Yelp system for S3-native object-access logging at scale. Preferred over AWS's per-event-priced CloudTrail Data Events ("orders of magnitude higher!" per the post) on cost grounds. Consumed by:
- Permission-debugging queries — bucket + timestamp + key slices to confirm identity / access results.
- Cost attribution —
regexp_extract(requester, '.*assumed-role/([^/]+)', 1)group-by to find the IAM role generating the call volume. - Incident response — slice by IP / user-agent / requester role to size a compromise.
- Data retention — access-based deletion of unused prefixes.
(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Architecture¶
- Log generation — each monitored bucket has a destination
bucket for SAL, same-account + same-region (AWS constraint;
also eliminates cross-region data-transfer cost). Format:
PartitionedPrefixwithEventTimedelivery — Yelp migrated off the defaultSimplePrefixafter the accumulated flat- namespace became Athena-unqueryable. - Daily compaction on Tron — converts the
previous day's raw SAL objects to Parquet via Athena INSERT
queries. Idempotent via self-LEFT-JOIN on
requestid(patterns/idempotent-athena-insertion-via-left-join). - Lambda enumerator — keeps the Glue
enum-type partition list forbucket_nameup-to-date. Reads from an SQS queue populated by periodic EventBridge rules; queue is necessary "to avoid concurrent reads and writes" on the partition list. - Object-tagging lifecycle — after successful insertion, the
source raw SAL objects are tagged
(patterns/object-tagging-for-lifecycle-expiration); each
destination bucket has a tag-based lifecycle policy that
deletes tagged objects. High-volume buckets use
S3 Batch Operations
PutObjectTagging; low-volume buckets bypass Batch Ops (flat $0.25 / bucket / job) and tag directly. - Weekly access-based table — joins
S3 Inventory with a week of SAL at
prefix granularity (equality-join on the extracted prefix,
not
LIKE— "~70,000 rows… from over 5 minutes to just 2 seconds"). Drives access-based retention of unused objects.
Glue / Athena schema¶
- Partitioning:
partition projection via
'storage.location.template'='s3://<dest>/<account>/<region>/${bucket_name}/${timestamp}'.bucket_nameusesenumtype (enumerated, kept in sync by the Lambda).timestampusesyyyy/MM/ddgranularity — "a day's worth of logs" is the common slice, finer granularity would cause over-partitioning. - Regex input.regex — the AWS-documented SAL regex, with
Yelp's modification: the user-controlled tail fields are
wrapped in an optional non-capturing group
(?:<rest>)?so the first seven non-user-controlled fields always parse. - Cross-account: a single querying AWS account registers Glue
Data Catalogs from every source account via
ListDataCatalogs. Table / database names embed account + region — "occasionally backfires, when a query produces 0 results and you realize that the bucket exists in a different account or region."
Scale and outcomes¶
- TiBs of SAL per day (fleet-wide).
- 85% storage reduction post-compaction.
- 99.99% object-count reduction post-compaction.
- < 0.001% of SAL arrives > 2 days late (measured straggler tail; acceptable because retention windows are much longer and deletion is at prefix granularity).
Comparison surfaces¶
- vs CloudTrail Data Events — $1 per million events, "orders of magnitude higher". Yelp's compaction-of-native-SAL is a cost-efficient substitute for object-level tracking.
- vs managed-partition Hive metastore — Glue partition
projection avoids
MSCK REPAIR/ALTER TABLErefresh churn and metastore-lookup query-planning latency. (patterns/projection-partitioning-over-managed-partitions) - vs per-object
DELETEAPI — object-tagging + lifecycle policy is the only scalable per-object-deletion primitive at Yelp's object-count scale (patterns/object-tagging-for-lifecycle-expiration).
Known limits / caveats¶
- Straggler SAL logs are ignored after the source objects expire; could be re-inserted with extra work.
enum-type partition projection has a 1M partition cap ifWHEREdoesn't constrain it — Yelp usesinjectedtype for access-based tables where the caller specifies bucket + date.- Backup buckets (CDC targets for unchanged tables) can look "unused" to access-based retention; flagged as future exemption axis.
- Athena is shared;
TooManyRequestsExceptionretry is a normal part of the compaction job's operation loop (concepts/athena-shared-resource-contention).
Future work¶
- Forward SAL to Splunk for engineer-facing troubleshooting (requires volume reduction + shorter retention).
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
full architectural disclosure of the pipeline, including
Parquet-compaction numbers, Glue partition-projection choice,
idempotent insertion shape, object-tagging deletion discipline,
SAL regex hazards (user-controlled fields; URL-encoding
idiosyncrasy on
BATCH.DELETE.OBJECT/S3.EXPIRE.OBJECToperations), and the access-based retention join over inventory + SAL.
Related¶
- systems/tron — batch orchestrator
- systems/aws-s3, systems/amazon-athena, systems/aws-glue, systems/apache-parquet, systems/s3-batch-operations, systems/s3-inventory
- companies/yelp
- concepts/s3-server-access-logs, concepts/partition-projection, concepts/best-effort-log-delivery, concepts/athena-shared-resource-contention
- patterns/raw-to-columnar-log-compaction, patterns/object-tagging-for-lifecycle-expiration, patterns/idempotent-athena-insertion-via-left-join, patterns/s3-access-based-retention, patterns/projection-partitioning-over-managed-partitions