Skip to content

PATTERN Cited by 1 source

Whole-article retrieval via metadata segments

Pattern shape

For a RAG system over a corpus of short, well-titled, header-structured documents (support articles, FAQ entries, knowledge-base pages), embed multiple metadata-derived segments per document into a vectorstore — typically (title, summary, each top header, optional historical-intent strings) — but retrieve to the whole document, not the chunk that matched.

The pattern resolves a common RAG trade-off:

  • Whole-article embedding dilutes the signal (concepts/embedding-signal-dilution).
  • Paragraph chunking produces too many false candidates.
  • Metadata-only-embedding + whole-article-retrieval sits between the two extremes.

Structure — five steps

                        ┌─────────────────────────┐
1. Document ingest:     │ Article (title, summary,│
                        │  body, headers, links)  │
                        └──────────┬──────────────┘
                  Extract: title / summary / each header /
                  historical intent strings (optional)
                        ┌─────────────────────────┐
2. Embed each segment:  │ Title  → vec_t          │
                        │ Summary → vec_s          │
                        │ Header1 → vec_h1         │
                        │ Header2 → vec_h2         │
                        │ ...                      │
                        └──────────┬──────────────┘
                        ┌─────────────────────────┐
                        │ Vectorstore (FAISS)     │
                        │ → article_id ↔ vec map  │
                        └──────────┬──────────────┘

3. Query path:                     │
                        ┌──────────▼──────────────┐
                        │ Embed query (vec_q)     │
                        └──────────┬──────────────┘
4. Over-fetch + threshold:         │
                        ┌──────────▼──────────────┐
                        │ Find k nearest vectors  │
                        │ k = max_segments × 5    │
                        │ Filter < threshold      │
                        └──────────┬──────────────┘
5. Dedupe to whole articles:       │
                        ┌──────────▼──────────────┐
                        │ Group by article_id     │
                        │ Take top-5 unique       │
                        │ Pass FULL ARTICLE bodies│
                        │ to LLM as RAG context   │
                        └─────────────────────────┘

Canonical instance — Yelp CS Chatbot (2026-05-27)

"Each of these components (the title, the summary, and each distinct top header etc.) is treated as a separate text input and individually embedded into the vectorstore. This method allows us to capture the distinct semantic signal from these concise segments, which was found to be more effective than embedding larger texts." (Source: sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot)

Operational substrate:

Property Value
Corpus size ~370 Yelp Support Center articles
Embedding model OpenAI text-embedding-ada-002
Embedding dim 1,536
Total vectorstore ~8 MB after FAISS quantization
Over-fetch factor k = max_items_per_article × 5
Filter Empirically-determined similarity threshold
Dedupe target Top-5 unique articles
Retrieval quality ~94% recall@5

Three structural pieces

  1. Per-article metadata extraction. Each article contributes (title, summary, each top header, optional historical-intent strings from legacy systems) as independent embedding segments. The article body is not embedded.
  2. Multi-segment-to-unique-article aggregation. Vectorstore queries return many vectors; deduplication by article_id collapses them to unique articles. Over-fetch (k = max_items × 5) ensures enough vectors to assemble top-K unique articles.
  3. Similarity threshold filter. Articles whose best-segment match still scores below an empirical threshold are discarded. Prevents low-confidence garbage from polluting the LLM context.

When to apply

Use when:

  • The corpus is small (hundreds to low-thousands of documents) — vectorstore size remains manageable for in-memory loading.
  • Each document has rich, structured metadata — well-titled, summary-bearing, header-structured. Support-center articles, FAQ entries, product knowledge bases.
  • Documents are short enough (a few hundred to ~2,000 words) that the whole article fits comfortably in the LLM context budget alongside top-5 retrievals.
  • User queries are short and narrow (Q&A-style) — the embedding-similarity match against narrow metadata segments is what makes this work.

Don't use when:

  • Corpus is large (10K+ documents) — metadata-only embedding may not provide enough discriminative signal.
  • Documents are long (10K+ words) — whole-article retrieval blows the LLM context budget.
  • Documents lack meaningful titles / summaries / headers — prose articles, news stories, internal SEO-driven content may not have query-relevant metadata.
  • Query distribution is heterogeneous — some queries need body-text matching that metadata-only embedding can't cover.

Trade-offs

  • Storage ↓↓ — Yelp's ~8 MB vs the ~50× larger paragraph-chunked equivalent.
  • Retrieval latency ↓ — fewer vectors to search; in-memory vectorstore feasible.
  • Recall vs paragraph chunking depends on metadata quality. Yelp reports ~94% recall@5 — the residual ~6% likely contains body-only queries that metadata can't cover.
  • LLM context size vs paragraph chunking ↑ — whole-article inclusion uses more context tokens but provides coherent answer surface area.
  • Hyperlink coherence ↑ — whole-article retrieval makes the link allowlist trivially the union of in-article links. Chunk retrieval fragments the allowlist.

Risks

  • Body-content coverage gap. Queries whose answer lives only in mid-paragraph body text miss. Mitigation: add a second-stage body-text re-search over the top-K metadata-matched articles for low-confidence first-stage results.
  • Metadata-rich-article dominance. Articles with many headers contribute many segments; without the over-fetch factor, top-K results can all come from the same article. Mitigation: the × 5 over-fetch factor.
  • Threshold tuning. The empirical similarity threshold needs per-corpus calibration; too high → low recall, too low → noise leaks in. Yelp doesn't disclose how the threshold is re-calibrated when the corpus changes.
  • Stale historical-intent strings. Yelp seeds the vectorstore with "a subset of historical intents/responses from the legacy chatbot" — these may become stale as the product surface evolves. No disclosed re-mining cadence.

Composes with

Seen in

Last updated · 542 distilled / 1,571 read