CONCEPT Cited by 1 source

S3 server access logs¶

S3 Server Access Logs (SAL) are the per-bucket access-log primitive Amazon S3 offers: every request against the bucket and its objects (GETs, PUTs, lifecycle expirations, multipart operations, website operations, etc.) generates a log line delivered to a configurable destination bucket. SAL is the cheapest AWS-native way to do object-level access tracing; the pricier alternative is CloudTrail Data Events at "$1 per million data events".

Definition¶

"S3 server access logs contain API operations performed on a bucket, as well as its objects. Logging is enabled per S3 bucket by providing a storage destination; another S3 bucket is recommended due to circular logging. Once the resource policy allows putting objects, logs will start arriving at the configured destination." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Line format¶

Each log line is space-separated with 25+ positional fields. The first seven are AWS-generated and not user-controlled:

file_bucket, remoteip, requester (IAM identity or -), requestid, operation (e.g. WEBSITE.GET.OBJECT, REST.GET.OBJECT, BATCH.DELETE.OBJECT, S3.EXPIRE.OBJECT), key, http_status. The remaining fields include request_uri, error_code, bytes_sent, object_size, total_time, turn_around_time, referrer, user_agent, version_id, host header / TLS info / etc. The regex specified by AWS for parsing lives in the AWS docs.

The schema is extensible: "The regex ends with .*$: it accounts for the possibility of additional columns being added at any time."

Delivery semantics¶

SAL is best-effort — "meaning a log may occasionally be missed, arrive late, or have duplicates." Best-effort log delivery is the load-bearing model you must design around. Yelp's measurement at fleet scale: < 0.001% of SAL arrives > 2 days late; they observed instances as late as 9 days. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Target-object key formats¶

Two shapes, controlled by TargetObjectKeyFormat:

SimplePrefix (historical default) — [TargetPrefix][YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]. Flat namespace. At scale this becomes Athena-unqueryable because of S3 API rate limits on prefix scans.
PartitionedPrefix (recommended for query workloads) — [TargetPrefix][SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/.... Gives Athena a natural partition boundary; Yelp migrated fleet-wide to this format. Delivery option EventTime (vs LogArrivalTime) "gives the benefit of attributing the log to the event time."

AWS added date-based partitioning for SAL in November 2023 — the unblocker that made object-level logging tractable via Athena querying.

Destination-bucket constraints¶

Same account as the source bucket (AWS restriction).
Same region as the source bucket (AWS restriction + cost: "eliminate cross-region data charges").
Resource policy on the destination must allow the logging service to PutObject.
Using the same bucket as destination causes circular logging — another bucket is strongly recommended.

User-controlled field hazards¶

Three fields are arbitrary user-input, written unescaped:

request_uri
referrer (HTTP Referer header)
user_agent

These break any naive space-or-quote-delimited regex. See concepts/user-controlled-log-fields for the general hazard and patterns/optional-non-capturing-tail-regex for the common workaround (wrap the user-controlled tail in (?:<rest>)? so the first seven non-user-controlled fields still parse).

Key encoding idiosyncrasy¶

Most SAL operations double-url-encode key; for some operations — notably BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT — the key is url-encoded only once. See concepts/url-encoding-idiosyncrasy-s3-keys for the full discussion and why naive url_decode(url_decode(key)) is unsafe.

Comparison to CloudTrail Data Events¶

Axis	S3 Server Access Logs	CloudTrail Data Events
Cost	storage of emitted logs (best-effort delivery, compactable)	$1 per million data events
Delivery	best-effort; ≤0.001% > 2-day late (Yelp measured)	reliable delivery
Partitioning	raw text, needs compaction for scale	delivered to S3 / CloudWatch, queryable
Granularity	bucket-level enablement	per-trail, fine-grained

At fleet scale the cost axis dominates — per the 2025-09-26 source, "$1 per million data events — that could be orders of magnitude higher!"

Seen in¶

sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki source. Full definition + line format + delivery semantics + destination-bucket constraints + hazards. Yelp's operational choice: PartitionedPrefix + EventTime default via Terraform module; daily Parquet compaction pipeline over the destination bucket (patterns/raw-to-columnar-log-compaction).

systems/aws-s3 — producer + typical destination.
systems/amazon-athena — the query engine over SAL.
systems/aws-cloudtrail — the pricier alternative for object-level access tracing.
concepts/best-effort-log-delivery
concepts/user-controlled-log-fields
concepts/url-encoding-idiosyncrasy-s3-keys
concepts/partition-projection
patterns/raw-to-columnar-log-compaction