CONCEPT Cited by 1 source
URL-encoding idiosyncrasy in S3 keys¶
The hazard¶
S3 server access logs write the key field url-encoded. But
the encoding is not uniform: for most request types it is
double-url-encoded; for at least two operation types
(BATCH.DELETE.OBJECT and S3.EXPIRE.OBJECT) it is
single-url-encoded.
"Initially, we were under the impression that the key is always double url-encoded based on data encountered. So, using
url_decode(url_decode(key))worked until we saw that for some operations, such asBATCH.DELETE.OBJECTandS3.EXPIRE.OBJECT, the key is url-encoded once. Therefore, decoding twice will fail on keys that contain percent (%) symbols as a character." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Yelp's working theory: lifecycle operations "likely come from an internal service" that follows a different encoding convention than customer-facing API paths.
Why a single-decode fallback doesn't rescue you¶
Naively falling back to single-decode when double-decode fails
doesn't work either, because the two encodings produce
ambiguous results on keys with user-controlled % characters.
Worked example from the 2025-09-26 post:
SAL key (url-encoded): foo-%2525Y
Single url-decode: foo-%25Y ← could be actual object name
Double url-decode: foo-%Y ← could also be actual object name
Both decodings succeed; neither can be proven correct from the
log line alone. The actual object key depends on which
operation wrote the SAL line (double-encoded if
WEBSITE.GET.OBJECT / REST.GET.OBJECT / most others;
single-encoded if BATCH.DELETE.OBJECT / S3.EXPIRE.OBJECT).
Practical handling¶
The operation field tells you which encoding was used:
# pseudo-code
if row.operation in ("BATCH.DELETE.OBJECT", "S3.EXPIRE.OBJECT"):
key = url_decode(row.key)
else:
key = url_decode(url_decode(row.key))
Any code that needs to round-trip a key from a SAL line into a live S3 API call (e.g. for an S3 Batch Operations manifest — systems/s3-batch-operations) must handle this operation-conditional decode correctly.
Why this ambiguity exists at all¶
S3 keys are arbitrary UTF-8 byte sequences (up to 1024 bytes).
SAL represents them as text — requiring an escape mechanism for
non-printable bytes, spaces, % signs, and the space-delimited
field boundaries of the log format. Percent-encoding is the
obvious choice. The double-encoding convention for
external-facing operations is likely a belt-and-suspenders
guarantee that the log line parses even through any intermediate
stringification that's URL-encoding-aware; the internal-service
single-encoding is presumably a simpler pipeline that AWS never
needed to harmonise with the external one.
Downstream consequences¶
- Round-tripping SAL → Batch Operations manifest (Yelp's
object-tagging flow for access-based retention) requires the
conditional decode. Manifest entries need
quote_plus(key, safe="/")form per the AWS manifest docs. - Prefix extraction for
access-based retention
must decode first, then split on
/. Failing to do so mixes encoded and decoded slashes.
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale —
canonical wiki disclosure of the idiosyncrasy. Names the two
lifecycle operations that single-encode, discusses the
foo-%2525Ycounter-example where naive fallback-to-single- decode produces ambiguous results, and calls out the downstream impact on manifest construction for S3 Batch Operations.
Related¶
- concepts/s3-server-access-logs — the primitive that exposes the idiosyncrasy.
- systems/aws-s3, systems/s3-batch-operations — the round-trip consumers.
- concepts/escaping-at-the-edge — the general discipline this idiosyncrasy violates.