CONCEPT Cited by 1 source
User-controlled log fields¶
Some fields in access logs / request logs carry unescaped
user-supplied bytes: HTTP request URI, Referer, User-Agent,
request body fragments. Because the HTTP request framing allows
arbitrary characters in those fields (including quotes, spaces,
control characters, shell metacharacters), any regex that tries
to parse them positionally can be broken by a counter-example
that includes a delimiter — either accidentally (legitimate
header values) or deliberately (injection probes, malware scans).
The canonical claim¶
"Essentially, any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Concrete breaking examples (from Yelp's 2025-09-26 post)¶
Quote nesting¶
$ curl https://s3-us-west-2.amazonaws.com/foo-bucket/foo.txt \
-H 'Referer: "a b" "c' -H 'User-Agent: "d e" "f'
Produces a SAL line with ""a b" "c" in the referrer field
and ""d e" "f" in user_agent — the "[^"]*" regex pattern
fails because the inner quotes aren't escaped or balanced.
SQLi / injection probes¶
...WEBSITE.GET.OBJECT public/wp-admin/admin-ajax.php
"GET /public/wp-admin/admin-ajax.php?action=ajax_get&
route_name=get_doctor_details&clinic_id=%7B"id":"1"%7D
&props_doctor_id=1,2)+AND+(SELECT+42+FROM+(SELECT(SLEEP(6)))b HTTP/1.1"
Shellshock probes¶
Why this matters¶
Log parsers that silently drop broken rows lose exactly the log lines that indicate something interesting:
- Malicious traffic (the bad actor is the one sending adversarial characters).
- Legitimate traffic from buggy or unconventional user-agents.
- Edge cases worth investigating.
Treating parse failure as a signal, not as noise, is load-bearing:
"An important takeaway is that if a row is empty, then the regex has failed to parse that row— so don't ignore those!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)
Design responses¶
1. Partition-aware regex (Yelp's fix)¶
Wrap user-controlled fields in an optional non-capturing
group (?:<rest>)? so the first non-user-controlled fields
always parse. For SAL: the first seven fields (file_bucket,
remoteip, requester, requestid, operation, key,
http_status) are AWS-generated and safe; the rest are
wrapped. Canonicalised as
patterns/optional-non-capturing-tail-regex.
2. Escape at the edge¶
The upstream service can escape problematic characters at log-write time: replace raw quotes / spaces with escape sequences, encode everything in a known encoding (hex, base64, quoted-printable). Costs log readability; wins parseability.
3. Switch to a structured format¶
Log as JSON (or CBOR, MessagePack) instead of space-separated. Pushes the escaping to the serialiser. AWS's CloudTrail takes this approach; SAL does not.
4. Treat failed rows as first-class¶
- Don't discard them.
- Route them to a distinct "parse-failures" sink.
- Alert if the failed-row rate spikes.
Axis of inheritance¶
Any log source that carries verbatim external input has this hazard:
- Web access logs (Apache combined, nginx, ALB, S3 SAL).
- Mail server logs (envelope-from, headers).
- SSH auth logs (presented usernames, client versions).
- SMTP DATA segments in received-mail logs.
Seen in¶
- sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki source for the hazard. Enumerates quote-nesting, SQLi, shellshock breaking examples; prescribes the optional-non-capturing-tail-regex fix; surfaces the parse- failure-is-a-signal discipline.
Related¶
- concepts/s3-server-access-logs — the canonical host for the hazard in the 2025-09-26 post.
- concepts/escaping-at-the-edge — the prevention strategy.
- patterns/optional-non-capturing-tail-regex — Yelp's fix.