Skip to content

CONCEPT Cited by 1 source

S3 server access logs

S3 Server Access Logs (SAL) are the per-bucket access-log primitive Amazon S3 offers: every request against the bucket and its objects (GETs, PUTs, lifecycle expirations, multipart operations, website operations, etc.) generates a log line delivered to a configurable destination bucket. SAL is the cheapest AWS-native way to do object-level access tracing; the pricier alternative is CloudTrail Data Events at "$1 per million data events".

Definition

"S3 server access logs contain API operations performed on a bucket, as well as its objects. Logging is enabled per S3 bucket by providing a storage destination; another S3 bucket is recommended due to circular logging. Once the resource policy allows putting objects, logs will start arriving at the configured destination." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Line format

Each log line is space-separated with 25+ positional fields. The first seven are AWS-generated and not user-controlled:

file_bucket, remoteip, requester (IAM identity or -), requestid, operation (e.g. WEBSITE.GET.OBJECT, REST.GET.OBJECT, BATCH.DELETE.OBJECT, S3.EXPIRE.OBJECT), key, http_status. The remaining fields include request_uri, error_code, bytes_sent, object_size, total_time, turn_around_time, referrer, user_agent, version_id, host header / TLS info / etc. The regex specified by AWS for parsing lives in the AWS docs.

The schema is extensible: "The regex ends with .*$: it accounts for the possibility of additional columns being added at any time."

Delivery semantics

SAL is best-effort"meaning a log may occasionally be missed, arrive late, or have duplicates." Best-effort log delivery is the load-bearing model you must design around. Yelp's measurement at fleet scale: < 0.001% of SAL arrives > 2 days late; they observed instances as late as 9 days. (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Target-object key formats

Two shapes, controlled by TargetObjectKeyFormat:

  • SimplePrefix (historical default) — [TargetPrefix][YYYY]-[MM]-[DD]-[hh]-[mm]-[ss]-[UniqueString]. Flat namespace. At scale this becomes Athena-unqueryable because of S3 API rate limits on prefix scans.
  • PartitionedPrefix (recommended for query workloads) — [TargetPrefix][SourceAccountId]/[SourceRegion]/[SourceBucket]/[YYYY]/[MM]/[DD]/.... Gives Athena a natural partition boundary; Yelp migrated fleet-wide to this format. Delivery option EventTime (vs LogArrivalTime) "gives the benefit of attributing the log to the event time."

AWS added date-based partitioning for SAL in November 2023 — the unblocker that made object-level logging tractable via Athena querying.

Destination-bucket constraints

  • Same account as the source bucket (AWS restriction).
  • Same region as the source bucket (AWS restriction + cost: "eliminate cross-region data charges").
  • Resource policy on the destination must allow the logging service to PutObject.
  • Using the same bucket as destination causes circular logging — another bucket is strongly recommended.

User-controlled field hazards

Three fields are arbitrary user-input, written unescaped:

  • request_uri
  • referrer (HTTP Referer header)
  • user_agent

These break any naive space-or-quote-delimited regex. See concepts/user-controlled-log-fields for the general hazard and patterns/optional-non-capturing-tail-regex for the common workaround (wrap the user-controlled tail in (?:<rest>)? so the first seven non-user-controlled fields still parse).

Key encoding idiosyncrasy

Most SAL operations double-url-encode key; for some operations — notably BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT — the key is url-encoded only once. See concepts/url-encoding-idiosyncrasy-s3-keys for the full discussion and why naive url_decode(url_decode(key)) is unsafe.

Comparison to CloudTrail Data Events

Axis S3 Server Access Logs CloudTrail Data Events
Cost storage of emitted logs (best-effort delivery, compactable) $1 per million data events
Delivery best-effort; ≤0.001% > 2-day late (Yelp measured) reliable delivery
Partitioning raw text, needs compaction for scale delivered to S3 / CloudWatch, queryable
Granularity bucket-level enablement per-trail, fine-grained

At fleet scale the cost axis dominates — per the 2025-09-26 source, "$1 per million data events — that could be orders of magnitude higher!"

Seen in

Last updated · 476 distilled / 1,218 read