Skip to content

CONCEPT Cited by 2 sources

Retrieval-Augmented Generation (RAG)

Definition

Retrieval-Augmented Generation (RAG) is the inference-time architectural pattern where an LLM's context is augmented with documents retrieved from an external knowledge base at query time, before the model generates its response. RAG is the canonical mechanism that lets a batch-trained frontier model reason over data not in its training corpus — including fresh / real-time / private data — without retraining.

The Corless 2026-01-13 Redpanda post names RAG alongside MCP as the two named inference-time real-time-data mechanisms:

"they can increasingly access and reason upon data presented in real time, such as scouring social media video and the latest posts and newsfeeds, or accessing a database in a RAG or MCP architecture, this is at inference time." (Source: sources/2026-01-13-redpanda-the-convergence-of-ai-and-data-streaming-part-1-the-coming-brick-walls)

Canonical RAG flow

  1. Index corpus — typically documents chunked + embedded into a vector database. See concepts/embedding-dimension-diminishing-returns for the dimensionality-vs-quality trade-off, and concepts/hybrid-retrieval-bm25-vectors for the common dense+sparse retrieval shape.
  2. Embed the query, retrieve top-k relevant chunks.
  3. Inject retrieved chunks into the LLM's prompt context ("retrieved-then-generated").
  4. Generate the response conditioned on retrieved context.

RAG as the iterative-pipeline axis

The 2025-06-24 Redpanda "streaming as backbone" essay canonicalised a concrete streaming-infrastructure benefit of RAG: replayability of long-lived tiered-storage streams lets teams re-run historical data through different embedding models or chunking strategies without re-extracting from source. See concepts/stream-replayability-for-iterative-pipelines (Source: sources/2025-06-24-redpanda-why-streaming-is-the-backbone-for-ai-native-data-platforms).

Caveats

  • Stub. This page is a minimal canonical anchor; deeper RAG architecture (chunking strategies, reranking, query rewriting, HyDE, self-consistency) is not walked here.
  • RAG ≠ training on real-time data. RAG exposes fresh data to a frozen model at inference time. The batch- training boundary is unchanged — the model's weights don't learn from retrieved chunks.
  • RAG hallucinations. RAG mitigates but doesn't eliminate hallucination; the model can still confabulate despite having correct retrieved context in its prompt.

Seen in

Last updated · 470 distilled / 1,213 read