Skip to content

CONCEPT Cited by 1 source

Federated vs indexed retrieval

Federated retrieval and indexed retrieval are the two high-level architectural choices for building a cross-source agentic retrieval surface. Dash's Josh Clemm frames the split as "very classic software engineering": process everything on the fly (federated) vs pre-process it at ingestion time (indexed) (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash).

Each has structural pros and cons that drive the choice — usually "start federated, move toward indexed as you scale."

Federated retrieval

Mechanism. At query time, the agent fans out to third-party APIs / MCP servers / connectors in parallel, each of which runs its own retrieval against its own data. Results are merged + re-ranked client-side.

Pros - Easy to get up and running — no index to build. - No storage cost — data stays with its owner. - Freshness is mostly free — you get whatever the source system has right now. - Adding new sources = adding new MCP servers / connectors; linear integration work.

Cons - At the mercy of source APIs. Speed, quality, and ranking vary unpredictably across providers. - Can't access company-wide content. You can query your data in each source system, but not content shared across the whole org (unless you have admin creds and a company-wide connector — at which point you're moving toward indexed). - Post-processing is heavy. Results from N sources must be merged + re-ranked with no shared scoring baseline. - Context window explodes. Every MCP tool definition lives in the agent's concepts/agent-context-window every turn; every source's result shape is different; token counts grow with source count. - Slow. Dash reports simple agentic MCP queries taking up to ~45 seconds end-to-end vs "within seconds" on the raw index (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash).

Indexed retrieval

Mechanism. A pre-built unified index spans all sources. A single retrieval tool (patterns/unified-retrieval-tool) queries the index; ranking happens offline or at query time against pre-computed signals (knowledge graph + per-user context + semantic + BM25).

Pros - Company-wide content is queryable because ingestion has admin-level access to each source. - Offline enrichment. You can run content-understanding / embedding / knowledge-graph building offline; runtime retrieval is cheap. - Offline ranking experiments. Iterate on recall / precision without paying runtime cost. - Fast. Retrieval is a read from a prepared structure, not a cross-API fan-out. - Lean context window. One retrieval tool; one ranked result list; no per-source tool schemas.

Cons - Ton of custom work. "This is not for the faint of heart. You have to write a lot of custom connectors." - Freshness via rate limits. Ingestion is bounded by per-source API rate limits; stale data is a product concern. - Storage cost. At scale, indexed retrieval is expensive to host. - Architecture choice. Classic vector-only RAG vs BM25 vs hybrid vs full graph-RAG is a load-bearing decision. Dash went graph-RAG (concepts/knowledge-graph + concepts/hybrid-retrieval-bm25-vectors).

Why the split is axis-defining for agents

For traditional keyword search, federated vs indexed is a performance / storage tradeoff. For agents, it becomes a context-budget tradeoff too:

  • Federated + MCP per source → N tool descriptions resident in context; per-query tokens scale with result fan-out.
  • Indexed + one super-tool → 1 tool description; per-query tokens are a pre-ranked top-K from one ranker.

The jump from traditional-search framing to agent framing is why Dash describes the pattern evolution as: "if you're just getting started, absolutely invest in those MCP tools and everything on the real-time side. And then, over time, as you start to see what your customers are doing and you start to get some more scale, look for opportunities to optimize overall."

Where Dash lands

Dash chose indexed retrieval (graph-RAG) as the primary path, but still exposes its index over MCP via the Dash MCP Server — so external agents consuming Dash see an MCP-compatible surface backed by the indexed substrate. Federated-style MCP clients get the indexed-side advantages without giving up the MCP integration shape.

Seen in

Last updated · 200 distilled / 1,178 read