Yelp — S3 server access logs at scale¶
Summary¶
Yelp Engineering post (2025-09-26) by the SRE / Storage team on
operationalising S3 Server Access Logs (SAL) at Yelp's fleet
scale — "TiBs of S3 server access logs per day" across many
buckets and multiple AWS accounts. Core architectural move: a
daily raw-to-Parquet compaction job run on
Tron (Yelp's in-house batch processing system)
that converts the continuous stream of tiny raw-text SAL objects
into a small number of columnar Parquet
objects queried via Athena. Reported
compaction outcomes: 85% storage reduction and 99.99%
object-count reduction. Once the data is in Parquet, a second-
tier access-based table joins S3 Inventory with a week of SAL
to determine unused prefixes, driving
access-based object deletion
at prefix granularity. Load-bearing subsystem disclosures
span log-format partitioning (S3
PartitionedPrefix + EventTime delivery vs the default flat
SimplePrefix), Glue partition
projection with enum type (vs managed partitions' MSCK
REPAIR overhead), idempotent
Athena insertions via LEFT JOIN ... target.requestid IS NULL,
object-tagging-
driven lifecycle expiration (the only scalable per-object delete
primitive when per-object DELETE API calls don't scale),
Athena's shared-
resource TooManyRequestsException contention model,
user-controlled log fields
as regex hazard (double-quote injection, shell-injection,
SQLi-payloads in request_uri / user_agent / referrer break
naive regex), and the
S3 key URL-encoding
idiosyncrasy (most operations encode keys twice; lifecycle
operations encode once; naive url_decode(url_decode(key)) can
over-decode user-controlled percent characters in keys). Canonical
first-party retrospective on running AWS-native S3 object-access
logging at production scale — every paragraph names a concrete
operational problem and the fix.
Key takeaways¶
-
Raw SAL at scale is unqueryable. Continuous stream of tiny text objects at the destination bucket → S3 API rate-limits on Athena reads → "impossible to query in Athena due to S3 API rate limits. As there's no way to slow down reads from Athena, data partitioning is one of the ways to solve the issue." Compaction to Parquet yields 85% storage reduction + 99.99% object-count reduction. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
-
CloudTrail Data Events are the Other Option — and it's orders-of-magnitude more expensive. "AWS provides a ready solution to log object-level events as CloudTrail Data Events, but you are charged per data event so it's substantially more expensive: $1 per million data events — that could be orders of magnitude higher!" SAL-to-Parquet compaction is Yelp's cost-efficient alternative to CloudTrail for object-level access logging. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
-
PartitionedPrefix + EventTime delivery replaces SimplePrefix. Default SAL
TargetObjectKeyFormatisSimplePrefix([TargetPrefix][YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]), a flat namespace that accumulates to Athena-unqueryable volumes.PartitionedPrefixproduces[TargetPrefix][SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/..., partitioned for Athena pruning. Yelp made it the Terraform module default and migrated existing targets. Delivery option:EventTime("attributes the log to the event time") over the alternative log-arrival-time option. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Glue partition projection with
enumtype beats managed partitions. Managed partitions requireMSCK REPAIRorALTER TABLEto refresh and suffer query-planning latency as partition count grows. Partition projection names partitions from the known prefix template ('storage.location.template'='s3://<dest>/0123.../${bucket_name}/${timestamp}'). Yelp pickedenumforbucket_name(enumerated, list of all buckets kept in Glue by a Lambda that reads an SQS queue populated by EventBridge rules) + day-granularity fortimestamp("encompasses an entire day, that is yyyy/MM/dd, which accelerates query pruning time because we typically query a day's worth of logs" — finer granularity would cause over-partitioning).enumcap: 1,000,000 partitions over long time windows unless constrained inWHERE. Access-based tables useinjectedtype instead (user must name value inWHERE). (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Regex parsing of SAL is hostile-input territory. The first seven fields (file_bucket, remoteip, requester, requestid, operation, key, size — roughly) are not user-controlled; the later fields (
request_uri,referrer,user_agent) are arbitrary bytes, unencoded. "Any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." Concrete hostile cases disclosed: acurl -H 'Referer: "a b" "c"'produces log fields with unmatched quote nesting; SQLi probes and shell-injection ("() { ignored; }; echo Content-Type: ...") inuser_agentbreak naive regex. Yelp's workaround: wrap the user-controlled fields in an optional non-capturing group(?:<rest>)?and always accept the first seven fields; regex ends.*$to tolerate future SAL schema additions. An empty row in the parsed output is the failure signal — "if a row is empty, then the regex has failed to parse that row— so don't ignore those!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Idempotent Athena insertion via self-
LEFT JOIN. Yelp's canonical insertion shape:
INSERT INTO "catalog_target"."database"."table_region"
SELECT *
FROM "catalog_source"."database"."table_region" source
LEFT JOIN "catalog_target"."database"."table_region" target
ON target.requestid = source.requestid
AND bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
WHERE bucket_name = 'foo-bucket'
AND timestamp = '2024/01/01'
AND target.requestid IS NULL
Duplicating the bucket_name / timestamp filters in both
ON and WHERE is "prudent because queries start taking
longer to run after each insertion." For tables without a
unique column, Yelp checksums row values. Net property: the
compaction job is retry-safe — any failure is recovered by
re-running over a day's data with no duplicate writes.
Canonicalised as patterns/idempotent-athena-insertion-via-left-join.
(Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
-
Object tagging is the only scalable per-object deletion primitive. "That's the only scalable way to delete per object without needing to modify lifecycle policy each time or issuing delete API calls." Shape: after a successful Athena insertion, tag every source SAL object for expiration; the bucket's lifecycle policy deletes all tagged objects. Per- object
DELETEAPI requests don't scale even with batch- delete. S3 Batch Operations doesn't supportDeleteas an action — but it does supportPutObjectTagging, so Yelp uses Batch Operations to issue the tagging at scale. Caveat: S3 Batch Operations carries a flat $0.25 per job per bucket fee — the biggest cost contributor for buckets with low SAL volume. Yelp's policy: "we directly tag objects for most buckets that generate low order access logs" (bypassing Batch Ops) and only use Batch Ops on the high-volume buckets. Canonicalised as patterns/object-tagging-for-lifecycle-expiration. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Straggler SAL logs are ignored. SAL delivery is best-effort (best-effort log delivery) and Yelp measured: "less than 0.001% of SAL logs arrive more than a couples days later. For example, we witnessed some logs arriving 9 days later." Design choice: ignore stragglers so the compaction job delivers business value in a timely fashion. They can be inserted later, after the tagged source objects have been expired. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
-
Access-based retention via inventory ⋈ SAL. Weekly access- based tables compute prefix access by joining S3 Inventory with a week of SAL. The prefix is defined as "only immediate keys under it, segmented by slash (/), and removing trailing slash" —
array_join(slice(split(rtrim(key, '/'), '/'), 1, cardinality(split(rtrim(key, '/'), '/')) - 1), '/'). Equality joins beat LIKE joins by a dramatic margin — "in a query over ~70,000 rows, simply switching from a LIKE operator to an equality (=) operator reduced execution time from over 5 minutes to just 2 seconds—a dramatic improvement" — because LIKE forces Athena's distributed planner into broadcast cross-join. Yelp's comfort with SAL best-effort is predicated on deletion at prefix granularity: "deletions are based on prefixes—so missing all logs for a given prefix would only occur for truly inactive data." Canonicalised as patterns/s3-access-based-retention. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Athena is a shared resource — retry-on-
TooManyRequestsException. "Athena is a shared resource so a query may be killed any time due to the cluster being overloaded, or occasionally hitting S3 API limits. Given our scale, we would encounter such errors on a regular basis." Yelp addressed this by reducing concurrent queries + requesting per-account / per-region DML query limit increases. Queue processing modelled as "asynchronous functions" to exploit the isolation-per-bucket of insertions. Canonicalised as concepts/athena-shared-resource-contention. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
S3 Batch Operations manifests: header row + URL-encoding gotchas. Athena query results include a header row that S3 Batch Operations interprets as a bucket name "causing job failures" — Yelp recreates manifest files in memory without headers. Object keys must be URL-encoded to
quote_plus(key, safe="/")equivalence before passing to Batch Ops. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
S3 key URL-encoding idiosyncrasy. Most operations double-url-encode the
keyfield in SAL logs; "for some operations, such asBATCH.DELETE.OBJECTandS3.EXPIRE.OBJECT, the key is url-encoded once" — Yelp's working theory is that lifecycle operations come from an internal service. Naive fallback to single-decoding is unsafe: keyfoo-%2525Ydecodes unambiguously twice, but the actual object could be eitherfoo-%Yorfoo-%25Y. Canonicalised as concepts/url-encoding-idiosyncrasy-s3-keys. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale) -
Glue Data Catalog cross-account pattern. A single "querying" AWS account registers Glue Data Catalogs from every source account (via
ListDataCatalogs), letting one IAM role run queriesFROM "catalog"."database"."table_region"across the fleet without role-pivoting. Requires cross-account Glue permissions + S3 read on each source bucket. Yelp embeds account and region into table / database names "to be explicit; that occasionally backfires, when a query produces 0 results and you realize that the bucket exists in a different account or region!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Operational numbers canonicalised¶
- TiBs of S3 server access logs generated per day (fleet-wide, scale-setter for the compaction problem).
- 85% storage reduction + 99.99% object-count reduction via raw-text-to-Parquet compaction (headline outcome).
- < 0.001% of SAL logs arrive more than a couple of days late (measured straggler tail; max observed ~9 days).
- 1 M partition cap on
enum-type partition projection if not constrained inWHERE. - $0.25 flat per-bucket-per-job S3 Batch Operations fee — the biggest cost contributor for low-SAL-volume buckets (drives the direct-tag vs batch-ops dispatch rule).
- $1 per million data events for CloudTrail Data Events (the rejected alternative, cited verbatim as "orders of magnitude higher!").
- ~70,000 rows / 5 min → 2 sec — equality-join vs LIKE-join speedup disclosed for the inventory ⋈ SAL access-based-table computation.
- 11-Nov 2023 — AWS shipped date-based partitioning for SAL (the unblocker Yelp cites for why object-level logging became tractable).
Systems¶
- systems/aws-s3 — the storage layer being monitored + the destination of the logs + the substrate of both raw and compacted data. Same-region same-account destination bucket constraint surfaced ("must exist in the same account, as well as region to eliminate cross-region data charges") and S3 Batch Operations + S3 Inventory added as named S3-family systems.
- systems/amazon-athena — the query engine over Parquet-
compacted SAL; the canonical
shared-resource
contention surface of this post;
$pathpseudo-column for extracting S3 object locations;GetQueryRuntimeStatisticsAPI for post-query row counts. - systems/aws-glue — Data Catalog + table metadata store;
partition projection (
enum/injected); Terraform-managed cross-account catalog registration. - systems/apache-parquet — the compacted-output format; "It includes metadata that allows skipping row groups or pages based on filter criteria which reduces data scanned."
- systems/s3-batch-operations — the per-bucket batch job
primitive used for
PutObjectTaggingfanout. Flat $0.25 per job per bucket; doesn't support Delete. - systems/s3-inventory — the daily object-listing used in the access-based retention join.
- systems/yelp-s3-sal-pipeline — the named Yelp system (this post's subject): daily Tron job that compacts SAL to Parquet, tags source objects for lifecycle expiration, and maintains access-based tables for prefix retention.
- systems/tron — Yelp's in-house batch processing system (open-source: github.com/Yelp/Tron); orchestrates the daily compaction and the weekly access-based-table build.
Concepts¶
- concepts/s3-server-access-logs — the AWS primitive ("best-
effort" delivery; log line format with 25+ fields; destination-
bucket-in-same-region-and-account constraint;
SimplePrefixvsPartitionedPrefixvsEventTimedelivery). - concepts/partition-projection — Glue/Athena partitioning
primitive that avoids managed-partition
MSCK REPAIR/ metastore lookup overhead;enumvsinjectedtypes; 1M partition cap onenumif unconstrained. - concepts/best-effort-log-delivery — the reliability model Yelp accepts (measured < 0.001% stragglers at 2-day threshold)
- the design knobs that make it acceptable (retention windows longer than the measured tail; deletion at prefix granularity).
- concepts/athena-shared-resource-contention — Athena's
shared-cluster model;
TooManyRequestsException; per-account / per-region DML concurrency quota. - concepts/user-controlled-log-fields — the regex-hazard shape: when arbitrary user input (HTTP headers, URIs) lands unescaped in log fields, every regex has a breaking counter-example.
- concepts/url-encoding-idiosyncrasy-s3-keys — most S3
operations double-url-encode keys in SAL; lifecycle operations
single-encode; naive
url_decode(url_decode(key))can over-decode user-controlled percent characters.
Patterns¶
- patterns/raw-to-columnar-log-compaction — convert a continuous stream of tiny raw-text log objects into a small number of columnar objects (Parquet) in a daily batch. Compression + columnar predicate-pushdown + object-count collapse drive both cost and query-performance wins. Canonical Yelp instance: 85% storage + 99.99% object-count reduction.
- patterns/object-tagging-for-lifecycle-expiration — to
delete millions of objects without modifying the lifecycle
policy per-batch or issuing per-object
DELETE, tag each object and let the bucket's tag-based lifecycle policy expire them. Composes with S3 Batch Operations'sPutObjectTaggingfor high-volume buckets and direct per- object tagging for low-volume buckets (to avoid the flat $0.25 per-job fee). - patterns/idempotent-athena-insertion-via-left-join — make
an
INSERT ... SELECTretry-safe by self-LEFT-JOINing on the target's unique column (requestidhere) and filteringtarget.requestid IS NULL; duplicate partition-column filters in bothONandWHEREto keep the join planner pruning correctly as the target grows. - patterns/projection-partitioning-over-managed-partitions —
choose Glue partition projection over managed partitions when
you have a known prefix template; avoids
MSCK REPAIR/ALTER TABLErefresh churn and metastore-lookup query-planning latency as partition count grows. - patterns/s3-access-based-retention — join S3 Inventory with SAL at prefix granularity to identify unused prefixes; delete at prefix granularity so best-effort SAL delivery doesn't over-delete active data. Cheaper + safer than blanket TTL-based lifecycle on multi-tenant buckets.
Architecture highlights (verbatim content references)¶
Compaction architecture¶
"Tron, Yelp's in-house batch processing system, runs daily and converts the previous day's logs to parquet format via Athena insertion queries. A Lambda is used to enumerate all possible bucket names because we opted to use a projection partition using
enumtype."
Queue processing (compaction)¶
Steps per bucket, per day:
- Location query — retrieve source-object S3 paths for
later tagging:
SELECT split_part(substr("$path", 6), '/', 1), substr("$path", 7 + length(...)) FROM ... GROUP BY "$path". Locations gathered first "because it saves us from expiring uninserted logs." - Idempotent insertion (LEFT JOIN shape above).
- Count verification — compare inserted rows against source
row count via
GetQueryRuntimeStatistics(for initial insertions) or distinct-count on target (for re-insertions, because duplicates won't enter). - Tag for lifecycle expiration — build an S3 Batch
Operations manifest (header-stripped,
quote_plus(key, safe="/")URL-encoded) OR directly tag objects (for low-volume buckets, to avoid the $0.25 per-job fee).
Access-based retention flow¶
Weekly:
- Join S3 Inventory with a week of SAL at prefix granularity
(equality join on extracted prefix; not
LIKE). - Build an access-based table keyed on (bucket, prefix, last_access).
- Join back to S3 Inventory to translate accessed-prefixes to full S3 object keys for Batch Ops manifest.
- Tag-and-expire unused objects via the same patterns/object-tagging-for-lifecycle-expiration pipeline.
Caveats disclosed¶
- Straggler SAL is ignored. Design choice — re-insertion after tagged objects have expired would require re-fetching. Acceptable because deletion is at prefix granularity.
- No manual intervention target — the post emphasises the robustness work: "Most of the effort was spent on making the process robust so that no manual intervention is needed for them."
- Regex breakage is inevitable. "Essentially, any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." Empty parsed rows are the failure signal.
- Backup buckets confound access-based deletion. "Backup buckets that store incremental changes (CDC) to a table that may appear to contain objects that haven't been accessed, when in fact it indicates that the data has not changed for the table." Future exemption axis.
- Future Splunk forwarding. Yelp wants to forward SAL to Splunk for engineer-facing troubleshooting, "reducing data volume to meet the target daily ingestion limit and applying shorter retention periods." Not shipped.
- $0.25 per-bucket-per-job Batch Ops fee dominates for low-SAL-volume buckets; hence the direct-tag-vs-batch-ops dispatch rule.
- Cross-account catalog-name embedding occasionally causes 0-result queries on wrong-account/region; Yelp accepts it as explicitness-over-brevity cost.
Scope-decision rationale¶
Tier-3 decisively on-scope. First-party production retrospective on AWS-native S3 object-access logging at fleet scale (TiBs/day). Load-bearing architectural disclosures across: (1) storage / data-engineering (raw-to-Parquet compaction, 85% + 99.99% reductions), (2) query engine (Athena partition projection, shared-resource contention, idempotent self-LEFT-JOIN insertion), (3) lifecycle management (object-tagging for scalable per-object deletion, access-based retention via inventory ⋈ SAL), (4) log-format parsing hazards (user-controlled fields, URL-encoding idiosyncrasy). Zero product-marketing content; concrete code snippets and numbers throughout. Opens Yelp's storage / data-engineering axis on the wiki, distinct from the 2025-02-04 LLM / search-serving-infra axis, the 2025-02-19 financial-systems / data-platform axis, and the 2025-04-15 corporate-security / networking axis.
Source¶
- Original: https://engineeringblog.yelp.com/2025/09/s3-server-access-logs-at-scale.html
- Raw markdown:
raw/yelp/2025-09-26-s3-server-access-logs-at-scale-2cb05adb.md
Related¶
- companies/yelp
- systems/aws-s3
- systems/amazon-athena
- systems/aws-glue
- systems/apache-parquet
- systems/s3-batch-operations
- systems/s3-inventory
- systems/yelp-s3-sal-pipeline
- systems/tron
- concepts/s3-server-access-logs
- concepts/partition-projection
- concepts/best-effort-log-delivery
- concepts/athena-shared-resource-contention
- concepts/user-controlled-log-fields
- concepts/url-encoding-idiosyncrasy-s3-keys
- patterns/raw-to-columnar-log-compaction
- patterns/object-tagging-for-lifecycle-expiration
- patterns/idempotent-athena-insertion-via-left-join
- patterns/projection-partitioning-over-managed-partitions
- patterns/s3-access-based-retention