Skip to content

PATTERN Cited by 1 source

Optional non-capturing tail regex

Problem

A log line has a fixed non-user-controlled prefix followed by user-controlled tail fields with unescaped, adversarial, or otherwise parse-breaking content (spaces, quotes, control characters, injection payloads — see concepts/user-controlled-log-fields). A single all-or-nothing regex will silently drop any row whose tail breaks it, losing exactly the rows worth analysing (malicious probes, buggy clients, unusual traffic).

Pattern

Partition the regex at the boundary between reliably-parseable prefix fields and hostile-input tail fields, and wrap the tail in an optional non-capturing group: (?:<tail-pattern>)?. The prefix fields always match; the tail fields match when they can, and the whole row still parses (with NULL tail fields) when they can't.

Canonical shape (Yelp's 2025-09-26 SAL regex)

"The first seven fields are not user controlled; with the key being url-encoded twice, for most operations … Since the rest of the fields were less critical to us than the requestid and key, non-encoded fields are wrapped in an optional non-capturing group, i.e. (?:<rest-of-fields>)?. Thus, we ensure that the first seven fields are present even if the rest of the fields fail regex matching." (Source: sources/2025-09-26-yelp-s3-server-access-logs-at-scale)

Yelp's regex (simplified):

^(field1) (field2) (\[field3\]) (field4) (field5) (field6) (field7)
(?: (user_field1) (user_field2) (...))?.*$

The .*$ at the end absorbs any future fields AWS adds to SAL's format (which is extensible per AWS docs).

Diagnostic discipline

The pattern is only useful if the caller distinguishes:

  • Row parses; tail fields populated — normal case.
  • Row parses; tail fields NULLregex failed on the tail; the log line probably contains interesting content. Don't discard! Log + investigate.
  • Row doesn't parse at all — prefix is wrong (hostile input in a non-user-controlled field, log format bug, wrong source).

Yelp's verbatim rule:

"An important takeaway is that if a row is empty, then the regex has failed to parse that row— so don't ignore those!"

When this pattern doesn't fit

  • Tail fields are load-bearing. If you can't live with NULL request_uri, you have to escape at the edge, not in the parser.
  • No clean prefix/tail split. If every field can contain user bytes (e.g. a bespoke log format that puts user input before host metadata), the regex approach fundamentally doesn't work — switch to a self-describing format (JSON, ndjson, Protobuf, CBOR).
  • Requirement is complete rows. Use a delimiter-robust parser (TSV with escaped tabs, CSV with RFC-4180 escaping) instead of regex.

Seen in

  • sources/2025-09-26-yelp-s3-server-access-logs-at-scale — canonical wiki instance. Yelp wraps S3 SAL's user-controlled tail fields (request_uri, referrer, user_agent, …) in (?:<rest>)? so the first seven non-user-controlled fields always parse even when a SQLi / shellshock probe / unescaped quote breaks the tail.
Last updated · 476 distilled / 1,218 read