Skip to content

DROPBOX 2026-01-28 Tier 2

Read original ↗

Dropbox: VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash

Summary

Edited + condensed version of a talk Josh Clemm (VP of Engineering for Dropbox Dash) gave as a guest speaker in Jason Liu's online RAG course on Maven. Mini deep-dives on five areas that together describe the Dash tech stack: (1) the end-to-end context engine — connectors → content understanding → knowledge-graph modeling → hybrid index → ranking; (2) the architectural choice of index-based vs federated retrieval and the tradeoffs each entails; (3) making MCP work at Dash scale — four concrete fixes for token/latency degradation; (4) the pragmatic shape of Dash's knowledge graph (notably: not stored in a graph database, built asynchronously, flattened into "knowledge bundles" re-ingested via the same index pipeline); and (5) the LLM-as-judge + DSPy evaluation flywheel that drives relevance quality improvement. Closes with "make it work, then make it better" advice: start federated + MCP, move toward indexed + knowledge-graph as usage scales.

Key takeaways

  1. Context engine is a five-stage pipeline. Connectors (custom crawlers honoring each app's rate limits, API quirks, ACLs) → content understanding + enrichment (normalize to markdown, extract titles/metadata/links, generate embeddings; specialised paths for images/PDFs/audio/video) → cross-app knowledge-graph modeling → secure stores (BM25 lexical index + vector store) → multiple ranking passes (personalized, ACL'd). "Once you have that, you can introduce APIs on top of it and build entire products like Dash." (Source: sources/2026-01-28-dropbox-knowledge-graphs-mcp-dspy-dash)

  2. BM25 carries more weight than conventional-wisdom RAG suggests. Dash ended up with hybrid retrieval (BM25 lexical + dense vectors) but explicitly notes: "we found BM25 was very effective on its own with some relevant signals. It's an amazing workhorse for building out an index." Framing is practical, not ideological — BM25 is not a legacy fallback, it's load-bearing. (concepts/hybrid-retrieval-bm25-vectors)

  3. Federated vs indexed retrieval is an explicit trade. Federated is fast to build + fresh + avoids storage cost but is at the mercy of third-party API quality/speed/ranking, cannot access company-wide content, requires heavy post-processing merge + re-ranking, and pays MCP tool-tokens every turn. Indexed unlocks company-wide connectors, offline enrichment, offline ranking experiments, and runtime speed — at the cost of a lot of custom connector code, freshness-via-rate-limit management, storage cost, and an architecture decision (vector / BM25 / hybrid / graph-RAG — Dash chose graph-RAG). Dash's stated advice: "start federated + MCP; move toward indexed as you scale." (concepts/federated-vs-indexed-retrieval)

  4. MCP at scale hits four concrete walls; Dash named each fix. (a) Tool definitions fill the context window → collapse many retrieval tools into one super-tool backed by the index (patterns/unified-retrieval-tool). (b) Tool results balloon context → modeling data in knowledge graphs "significantly cut our token usage" (the graph bundle is a pre-ranked digest, not a raw dump). (c) Tool results are heavy → store them locally, not in the LLM context window (concepts/agent-context-window). (d) Complex queries sprawl → delegate to sub-agents with a classifier picking the sub-agent and a much narrower tool set (patterns/specialized-agent-decomposition). Quantified pain: "even a simple query can take up to 45 seconds" via MCP vs "within seconds" against the raw index.

  5. Knowledge graphs are not in a graph DB. Dash experimented with graph databases and found latency + query pattern + hybrid retrieval integration each a challenge. Instead, Dash builds relationships asynchronously and flattens them into "knowledge bundles" — summary embeddings, essentially — that are then pushed through the same index pipeline (chunked + embedded for both lexical and semantic retrieval). The graph is real; the runtime representation is not a graph. (concepts/knowledge-graph)

  6. Canonical entity IDs across apps are the load-bearing graph primitive. Every connected app has its own notion of "who someone is." Resolving to one canonical ID per person lets the graph answer "Jason's past context engineering talks" without per-app fan-out; Dash scores retrievals with NDCG and saw "really nice wins" from the people-based representation alone. (patterns/canonical-entity-id)

  7. LLM-as-judge iteration is a flywheel — and the judge sometimes needs its own retrieval. Dash's quality-improvement arc: first-prompt judge disagreed with humans on ~8% of labels → prompt refinement ("provide explanations") lowered it → upgrading to OpenAI's o3 reasoning model lowered it further → RAG as a judge (letting the judge fetch work-context like acronyms it wasn't trained on — "What is RAG?" in Dash's domain) lowered it further → adding DSPy lowered it further. The judge pipeline itself becomes a measured sub-system. (concepts/llm-as-judge, concepts/rag-as-a-judge)

  8. DSPy plus "disagreements as bullet points" is an emergent optimization pattern. Instead of feeding DSPy raw prompts, Dash feeds it structured bullets of judge-vs-human disagreements and lets DSPy minimize the disagreement set. The benefits Dash calls out: prompt optimization for LLM-as-judge specifically (clear rubrics make DSPy especially effective), prompt management at scale (Dash has ~30 prompts across ingest / judge / offline evals / online agentic platform; 5–15 engineers tweaking at any time; programmatic generation beats hand-edited strings in a repo), and model switching (plug new model in → DSPy re-optimises the prompt; critical for agentic systems with a planning LLM plus many narrow sub-agents each on a specialised model). (patterns/prompt-optimizer-flywheel)

  9. "Make it work, then make it better" is Dash's explicit advice for others. Start with MCP + real-time federated retrieval; invest in indexed + knowledge graph + offline enrichment + LLM judges + prompt optimizers as you see scale. The post explicitly says the Dash techniques described took "the last few years with a big engineering team working on this day-in and day-out."

Extracted systems

  • systems/dropbox-dash — the product; this post adds the explicit five-stage context-engine framing + multimodal content-understanding detail + hybrid-BM25+vectors-vs-graph-RAG choice + "super tool" framing for the unified retrieval tool.
  • systems/dash-search-index — extended by: BM25 + dense-vectors hybrid layout; BM25 "very effective on its own with some relevant signals"; multiple ranking passes applied to retrieved results; ranking passes are "personalized and ACL'd to you"; knowledge-bundle records re-ingested through the same pipeline.
  • systems/dash-mcp-server — confirmed as the external realization of the "one super-tool" discipline; no new architectural detail in this post beyond the 2025-11-17 source.
  • systems/model-context-protocol — post explicitly names the four MCP-at-scale pain points (context window from tool defs, result size, latency, query sprawl) and the corresponding fixes.
  • systems/dspy — Dropbox production usage: LLM-as-judge prompts, emerging "bullet-of-disagreements" input pattern, model switching across planning + sub-agents, ~30 prompts at Dropbox Dash.
  • systems/bm25 — load-bearing lexical retrieval in Dash; "an amazing workhorse for building out an index."

Extracted concepts

  • concepts/knowledge-graph — updated with: canonical-ID insight, async graph building, not stored in a graph DB, flattened into "knowledge bundles" re-ingested via the same index pipeline.
  • concepts/llm-as-judge — updated with: four-step Dash disagreement-reduction arc (baseline 8% → prompt refinement → o3 reasoning model → RAG-as-judge → DSPy), prompt-as-judge rubric clarity as a reason DSPy works especially well here.
  • concepts/context-engineering — reinforced by a second Dropbox source: explicit MCP-pain pattern inventory, explicit "store tool results locally, not in context" advice.
  • concepts/context-rot — reinforced: "you're immediately going to fill up that context window. It's going to be very problematic." MCP-related context growth is the forcing function.
  • concepts/agent-context-window — reinforced with a new lever: store tool results locally rather than inline them into the window.
  • concepts/hybrid-retrieval-bm25-vectors — new concept page. BM25 lexical index + dense vectors in a vector store; hybrid retrieval possible but BM25 alone "very effective with some relevant signals."
  • concepts/federated-vs-indexed-retrieval — new concept page with the explicit pro/con list from the talk.
  • concepts/rag-as-a-judge — new concept. The judge fetches context it wasn't trained on (acronyms, work-specific terms) before scoring.
  • concepts/ndcg — new concept (Normalized Discounted Cumulative Gain); Dash's named scoring metric for retrieval quality.

Extracted patterns

  • patterns/unified-retrieval-tool — reinforced; Dash explicitly calls this the "super tool" and frames it as a context-window hygiene play for MCP.
  • patterns/precomputed-relevance-graph — reinforced; the "knowledge bundles" are the graph precompute materialized for runtime retrieval.
  • patterns/tool-surface-minimization — reinforced; Dash's four MCP-scale fixes are all instances of this discipline applied at different layers.
  • patterns/specialized-agent-decomposition — extended with a new sub-mechanism: classifier picks the sub-agent (complex agentic queries route via a classifier to the sub-agent whose narrow toolset matches).
  • patterns/prompt-optimizer-flywheel — new pattern. Judge disagreements → bullet-pointed structured input → DSPy → reduced disagreements → loop. Emergent observation from Dash's usage.
  • patterns/multimodal-content-understanding — new pattern. Documents (markdown + text extract), images (CLIP-class + true multimodal for complex), PDFs (text + figures), audio (transcription), video (multimodal scene extraction e.g. the "Jurassic Park dinosaur scene with no dialogue" example) — each a different ingestion path under the same context-engine umbrella.
  • patterns/canonical-entity-id — new pattern. Cross-app entity resolution to a single canonical ID per person / doc / project so the graph has a coherent node model independent of source system.

Operational numbers

  • ~45 seconds — end-to-end latency of even a simple query when going via MCP at scale. "With the raw index, you're getting all the content coming back very quickly, within seconds."
  • ~100,000 tokens — Dropbox's target context-window cap for Dash agents. Tool definitions alone can eat significant budget.
  • ~8% — first-pass judge / human disagreement rate (the starting point of Dash's iteration arc). Subsequent prompt tweaks + o3 + RAG-as-judge + DSPy each reduced it further (no specific post-iteration numbers given).
  • ~30 — total prompts in the Dash stack today across ingest, LLM-as-judge, offline evals, and the online agentic platform.
  • 5–15 — engineers concurrently tweaking prompts at any given time (prompt management at scale is a first-class problem).
  • 50 tabs + 50 SaaS accounts — opening framing of the problem the context engine exists to solve; not a Dash metric but Clemm's own usage.

Caveats

  • Talk-edit, not architecture post. The piece is an edited transcript of a Maven guest lecture, not a canonical Dash architecture post. Details are impressionistic in places (e.g. "we use NDCG a lot to score results" — but no NDCG values disclosed).
  • No quantified quality/latency deltas. The LLM-as-judge iteration arc is told as "disagreements went down" at each step — only the starting 8% is stated.
  • Knowledge-graph implementation detail sparse. The "asynchronous graph building → knowledge bundles → same index pipeline" story is clear but no numbers (graph scale, refresh rate, bundle size, number of entity types).
  • No latency numbers for the raw index vs MCP claim. "Within seconds" for raw-index and "up to 45 seconds" for MCP is a talking-point comparison, not a rigorous measurement.
  • Classifier sub-agent routing is mentioned but not described architecturally (what's the classifier's model? What's the fallback?).
  • BM25 + vectors + graph-RAG is stated as Dash's choice, but the post doesn't explain the composition mechanics of the three in ranking — how graph signals combine with lexical/vector scores is not disclosed.
  • DSPy integration surface — "across our entire stack" is stated aspirationally; actual deployment is LLM-as-judge primarily, with other areas being rolled out.
  • Editorial note: this is one of two Dropbox posts in the wiki introducing Dash as an agentic system — pairs with the 2025-11-17 context-engineering post, which is the deeper architectural treatment. This post adds the five-stage pipeline view, the federated-vs-indexed comparison, multimodal content understanding, and the LLM-as-judge / DSPy arc — axes the 2025-11-17 post doesn't cover.
Last updated · 200 distilled / 1,178 read