Skip to content

CONCEPT Cited by 1 source

URL-encoding idiosyncrasy in S3 keys

The hazard

S3 server access logs write the key field url-encoded. But the encoding is not uniform: for most request types it is double-url-encoded; for at least two operation types (BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT) it is single-url-encoded.

"Initially, we were under the impression that the key is always double url-encoded based on data encountered. So, using url_decode(url_decode(key)) worked until we saw that for some operations, such as BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT, the key is url-encoded once. Therefore, decoding twice will fail on keys that contain percent (%) symbols as a character." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Yelp's working theory: lifecycle operations "likely come from an internal service" that follows a different encoding convention than customer-facing API paths.

Why a single-decode fallback doesn't rescue you

Naively falling back to single-decode when double-decode fails doesn't work either, because the two encodings produce ambiguous results on keys with user-controlled % characters.

Worked example from the 2025-09-26 post:

SAL key (url-encoded):   foo-%2525Y
Single url-decode:       foo-%25Y       ← could be actual object name
Double url-decode:       foo-%Y         ← could also be actual object name

Both decodings succeed; neither can be proven correct from the log line alone. The actual object key depends on which operation wrote the SAL line (double-encoded if WEBSITE.GET.OBJECT / REST.GET.OBJECT / most others; single-encoded if BATCH.DELETE.OBJECT / S3.EXPIRE.OBJECT).

Practical handling

The operation field tells you which encoding was used:

# pseudo-code
if row.operation in ("BATCH.DELETE.OBJECT", "S3.EXPIRE.OBJECT"):
    key = url_decode(row.key)
else:
    key = url_decode(url_decode(row.key))

Any code that needs to round-trip a key from a SAL line into a live S3 API call (e.g. for an S3 Batch Operations manifest — systems/s3-batch-operations) must handle this operation-conditional decode correctly.

Why this ambiguity exists at all

S3 keys are arbitrary UTF-8 byte sequences (up to 1024 bytes). SAL represents them as text — requiring an escape mechanism for non-printable bytes, spaces, % signs, and the space-delimited field boundaries of the log format. Percent-encoding is the obvious choice. The double-encoding convention for external-facing operations is likely a belt-and-suspenders guarantee that the log line parses even through any intermediate stringification that's URL-encoding-aware; the internal-service single-encoding is presumably a simpler pipeline that AWS never needed to harmonise with the external one.

Downstream consequences

  • Round-tripping SAL → Batch Operations manifest (Yelp's object-tagging flow for access-based retention) requires the conditional decode. Manifest entries need quote_plus(key, safe="/") form per the AWS manifest docs.
  • Prefix extraction for access-based retention must decode first, then split on /. Failing to do so mixes encoded and decoded slashes.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki disclosure of the idiosyncrasy. Names the two lifecycle operations that single-encode, discusses the foo-%2525Y counter-example where naive fallback-to-single- decode produces ambiguous results, and calls out the downstream impact on manifest construction for S3 Batch Operations.
Last updated · 476 distilled / 1,218 read