CONCEPT Cited by 5 sources
Retrieval-Augmented Generation (RAG)¶
Definition¶
Retrieval-Augmented Generation (RAG) is the inference-time architectural pattern where an LLM's context is augmented with documents retrieved from an external knowledge base at query time, before the model generates its response. RAG is the canonical mechanism that lets a batch-trained frontier model reason over data not in its training corpus — including fresh / real-time / private data — without retraining.
The Corless 2026-01-13 Redpanda post names RAG alongside MCP as the two named inference-time real-time-data mechanisms:
"they can increasingly access and reason upon data presented in real time, such as scouring social media video and the latest posts and newsfeeds, or accessing a database in a RAG or MCP architecture, this is at inference time." (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls)
Canonical RAG flow¶
- Index corpus — typically documents chunked + embedded into a vector database. See concepts/embedding-dimension-diminishing-returns for the dimensionality-vs-quality trade-off, and concepts/hybrid-retrieval-bm25-vectors for the common dense+sparse retrieval shape.
- Embed the query, retrieve top-k relevant chunks.
- Inject retrieved chunks into the LLM's prompt context ("retrieved-then-generated").
- Generate the response conditioned on retrieved context.
RAG as the iterative-pipeline axis¶
The 2025-06-24 Redpanda "streaming as backbone" essay canonicalised a concrete streaming-infrastructure benefit of RAG: replayability of long-lived tiered-storage streams lets teams re-run historical data through different embedding models or chunking strategies without re-extracting from source. See concepts/stream-replayability-for-iterative-pipelines (Source: sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platforms).
Caveats¶
- Stub. This page is a minimal canonical anchor; deeper RAG architecture (chunking strategies, reranking, query rewriting, HyDE, self-consistency) is not walked here.
- RAG ≠ training on real-time data. RAG exposes fresh data to a frozen model at inference time. The batch- training boundary is unchanged — the model's weights don't learn from retrieved chunks.
- RAG hallucinations. RAG mitigates but doesn't eliminate hallucination; the model can still confabulate despite having correct retrieved context in its prompt.
Seen in¶
- 2026-01-13 Redpanda — The convergence of AI and data streaming, Part 1 (sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls) — named as one of the two inference-time real-time-data access mechanisms (alongside MCP) that do not cross the batch-training boundary.
- 2025-06-24 Redpanda — Why streaming is the backbone for AI-native data platforms (sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platforms) — stream-replayability as the iterative-RAG-pipeline unlock.
- 2025-02-04 Yelp — Search query understanding with LLMs: from ideation to production (sources/2025-02-04-yelp-search-query-understanding-with-llms) — side-input RAG variant at structured-extraction altitude. Yelp augments LLM prompts with structured signals from in- house ML systems ("names of businesses that have been viewed for that query" for segmentation; "most relevant business categories" for review-highlight expansion) rather than retrieving document chunks. See patterns/rag-side-input-for-structured-extraction for the pattern.
- 2026-04-23 AWS — Modernizing KYC with AWS serverless solutions and agentic AI (sources/2026-04-23-aws-modernizing-kyc-with-aws-serverless-solutions-and-agentic-ai) — context-aware-retrieval variant over a regulatory corpus. S3 stores regulations (BSA, USA PATRIOT Act, AMLD, MAS, FATF), compliance rules, and vendor docs; OpenSearch Serverless indexes Bedrock-generated embeddings; queries are enriched with jurisdiction / document-type / risk-level metadata before vector search so the returned chunks are regulatorily-relevant, not just semantically close. Canonical agent-consumption pattern — the five KYC sub-agents consume the retrieved chunks as grounding for explainable compliance decisions. See concepts/context-aware-retrieval.
- 2026-05-27 Yelp — Beyond the Menu Tree: How Yelp Built a
Smarter Customer Success Chatbot with AI
(sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot)
— customer-support / chatbot RAG variant with two
load-bearing structural primitives. (1)
Metadata-only embedding +
whole-article retrieval
resolves the chunk-size dilemma: embedding the article body
dilutes signal (concepts/embedding-signal-dilution),
embedding paragraph chunks produces too many false candidates;
Yelp embeds
(title, summary, each top header)as separate segments — narrow embeddings — but retrieves to the whole article via dedupe-by-article-id. Disclosed retrieval quality: ~94% recall@5 on Yelp's evaluation dataset. The pattern requires metadata-rich source documents (well-titled, summary-bearing, header-structured); the canonical fit is support-center articles. (2) In-container in-memory vectorstore (patterns/in-memory-vectorstore-loaded-at-container-start) — ~370 articles × ~5 segments × 1,536-dim ada-002 vectors → ~8 MB FAISS-quantized vectorstore loaded directly into container memory at health-check time. No remote vector DB. Refresh substrate is daily S3 batch CSV pipeline (data-not-index in S3; FAISS index rebuilt every container start). The QA workflow is one of five workflows (RAG runs only for QA; Cancel + Review return templates; Billing returns deterministic UI; Refund guides through a form) — the LLM-as-router shape minimises LLM generative surface area. Three-axis output gate (trust & safety / valid URL / character limit) catches LLM-generated content before delivery; the URL check is the canonical wiki disclosure of LLM hyperlink hallucination mitigation via per-response allowlist validation. A/B-test outcome: doubled chatbot resolution rate vs the legacy menu-tree + fixed-phrase chatbot.
Related¶
- concepts/frontier-model-batch-training-boundary — the structural boundary RAG operates at inference side of.
- concepts/hybrid-retrieval-bm25-vectors — common retrieval shape.
- concepts/rag-as-a-judge — RAG-adjacent evaluation pattern.
- concepts/embedding-dimension-diminishing-returns — the dimensionality trade-off for the embedding step.
- concepts/stream-replayability-for-iterative-pipelines — the streaming-infrastructure unlock for iterative RAG.
- systems/model-context-protocol — the sibling inference-time-integration shape.
- companies/redpanda — the company whose blog canonicalises this framing.