Skip to content

CONCEPT Cited by 1 source

User-controlled log fields

Some fields in access logs / request logs carry unescaped user-supplied bytes: HTTP request URI, Referer, User-Agent, request body fragments. Because the HTTP request framing allows arbitrary characters in those fields (including quotes, spaces, control characters, shell metacharacters), any regex that tries to parse them positionally can be broken by a counter-example that includes a delimiter — either accidentally (legitimate header values) or deliberately (injection probes, malware scans).

The canonical claim

"Essentially, any regex pattern for parsing S3 server access logs can be broken by a counter example that includes delimiters." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Concrete breaking examples (from Yelp's 2025-09-26 post)

Quote nesting

$ curl https://s3-us-west-2.amazonaws.com/foo-bucket/foo.txt \
       -H 'Referer: "a b" "c' -H 'User-Agent: "d e" "f'

Produces a SAL line with ""a b" "c" in the referrer field and ""d e" "f" in user_agent — the "[^"]*" regex pattern fails because the inner quotes aren't escaped or balanced.

SQLi / injection probes

...WEBSITE.GET.OBJECT public/wp-admin/admin-ajax.php
"GET /public/wp-admin/admin-ajax.php?action=ajax_get&
route_name=get_doctor_details&clinic_id=%7B"id":"1"%7D
&props_doctor_id=1,2)+AND+(SELECT+42+FROM+(SELECT(SLEEP(6)))b HTTP/1.1"

Shellshock probes

..."() { ignored; }; echo Content-Type: text/html;
echo ; /bin/cat /etc/passwd" "Mozilla/5.0..."

Why this matters

Log parsers that silently drop broken rows lose exactly the log lines that indicate something interesting:

  • Malicious traffic (the bad actor is the one sending adversarial characters).
  • Legitimate traffic from buggy or unconventional user-agents.
  • Edge cases worth investigating.

Treating parse failure as a signal, not as noise, is load-bearing:

"An important takeaway is that if a row is empty, then the regex has failed to parse that row— so don't ignore those!" (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Design responses

1. Partition-aware regex (Yelp's fix)

Wrap user-controlled fields in an optional non-capturing group (?:<rest>)? so the first non-user-controlled fields always parse. For SAL: the first seven fields (file_bucket, remoteip, requester, requestid, operation, key, http_status) are AWS-generated and safe; the rest are wrapped. Canonicalised as patterns/optional-non-capturing-tail-regex.

2. Escape at the edge

The upstream service can escape problematic characters at log-write time: replace raw quotes / spaces with escape sequences, encode everything in a known encoding (hex, base64, quoted-printable). Costs log readability; wins parseability.

3. Switch to a structured format

Log as JSON (or CBOR, MessagePack) instead of space-separated. Pushes the escaping to the serialiser. AWS's CloudTrail takes this approach; SAL does not.

4. Treat failed rows as first-class

  • Don't discard them.
  • Route them to a distinct "parse-failures" sink.
  • Alert if the failed-row rate spikes.

Axis of inheritance

Any log source that carries verbatim external input has this hazard:

  • Web access logs (Apache combined, nginx, ALB, S3 SAL).
  • Mail server logs (envelope-from, headers).
  • SSH auth logs (presented usernames, client versions).
  • SMTP DATA segments in received-mail logs.

Seen in

Last updated · 476 distilled / 1,218 read