Skip to content

PATTERN Cited by 1 source

Aho-Corasick snippet extraction

Intent

Shrink long retrieved documents (reviews, website pages, menus, community-Q&A threads) to just the short spans around keyword hits before passing them to the answer- generation LLM. Use Aho-Corasick multi-pattern matching to find all keyword occurrences in one linear pass, then extract a sliding window around each hit.

The result: the LLM sees only the text likely to be relevant, cutting input-prompt tokens materially without a material recall hit — as long as the keyword-generation step upstream produced good terms.

When to apply

  • You're doing keyword-first retrieval (see patterns/split-source-selection-from-keyword-generation) — the keywords are already computed and available.
  • Retrieved documents are long relative to the final prompt budget. Full-review + full-website-page + full-menu context easily blows past 10K tokens; the answer LLM doesn't need every byte.
  • You can tolerate keyword-miss blind spots — content relevant to the question but not matching any generated keyword won't be surfaced. Accept this as a tuned-keywords problem, not a snippet-extraction problem.
  • You're optimising input tokens (prompt size) — most of the LLM cost for a RAG system lives here.

Mechanism

Step 1 — Build Aho-Corasick automaton from keywords

The upstream keyword generator (see patterns/split-source-selection-from-keyword-generation) emits a compact list of search terms. Compile them into a single Aho-Corasick automaton — finite-state-machine that can scan a document in O(n + m) where n = document length, m = total keyword-match count.

Aho-Corasick beats naive per-keyword substring scanning when you have more than a handful of keywords, because it processes all patterns in a single pass.

Step 2 — Scan the retrieved document

Run the automaton over the document. Record the offset of every keyword hit.

Step 3 — Sliding-window extraction around each hit

For each hit, extract a window of ±N characters (or ±N sentences) around the hit. Merge overlapping windows. Deduplicate.

The output: a compact set of snippets — typically a small fraction of the original document length.

Step 4 — Ship snippets to the answer LLM

The answer-generation prompt now carries the snippets (with source attribution) rather than full documents. The LLM synthesises the final answer + citations from this reduced context.

Canonical wiki instance — Yelp BAA (2026-03-27)

Source: sources/2026-03-27-yelp-building-biz-ask-anything-from-prototype-to-product

Yelp names Aho-Corasick + sliding-window as one of the three load-bearing cost-optimisation levers in the Biz Ask Anything cost pass:

"Reducing Input Tokens: Keyword Snippet Extraction — A combination of Ahocorasick + sliding window was used for snippet extraction. This algorithm uses keywords generated during the keyword generation process, which helped with reducing the size of the input user prompt."

Complementary levers called out in the same section:

  • Biz content cleanup — pre-processing website/menu data to remove boilerplate ("websites and reviews to be the largest contributors to the user prompt size").
  • Model-tier migration — answer-generation model moved to a newer OpenAI model "with less cost and negligible effort" after the cost reduction was validated via the quality-grader fleet.

The cumulative outcome: per-question cost reduced to ~25% of the prototype.

Why Aho-Corasick specifically

The answer-generation prompt needs to support many keywords at once (typical keyword lists from Yelp's vertical-specific generator have 10-20 terms — and sometimes zero). Running each keyword as an independent regex scan would be O(n × k) per document. Aho-Corasick folds all k keywords into one automaton for O(n + m) scanning — which matters at Yelp's p75 <3 s latency budget when the review corpus per business can be dozens of KB.

Why it works

  • Long documents are sparse in signal — most text in a review is off-topic for any given question. Extraction exploits that sparsity.
  • String matching is cheap relative to LLM inference — a few milliseconds of CPU to save thousands of prompt tokens is a huge ROI trade.
  • The upstream keyword generator has already done the "what's relevant" work; snippet extraction is just the physical materialisation of that decision into fewer tokens.

Failure modes

  • Keyword quality cap. If generated keywords are too vague (match everything) or too specific (match nothing), the snippet set is bad regardless of the matcher quality. Yelp explicitly names this trade-off.
  • Generic questions. For prompts like "what should I know about this place?" the keyword generator emits no keywords (by design). Aho-Corasick can't extract from an empty automaton — the path switches to "fetch recent content without term constraints" at the retrieval layer, bypassing snippet extraction entirely.
  • Paraphrase / semantic gaps. "quiet" vs. "not loud" — string matching misses semantic paraphrases. Mitigation: embedding-based retrieval as a complement (Yelp explicitly names this as a followup).
  • Window-boundary artefacts. A snippet that cuts off mid-sentence confuses the LLM. Snap windows to sentence boundaries when possible.

Relation to sibling patterns

Seen in

Last updated · 550 distilled / 1,221 read