CONCEPT Cited by 1 source

Heuristic retrieval¶

Definition¶

Heuristic retrieval is a stage-1 retrieval primitive that uses domain-specific, non-ML rules — ownership metadata, graph traversal, time windows, string matching, environment filters — to narrow a large candidate population (thousands or more) to a rank-tractable set (hundreds) cheaply and without significant recall loss, so that a more expensive downstream stage (an LLM ranker, a human reviewer, a cross-encoder) can operate.

It is distinguished from:

Lexical retrieval (BM25, inverted index) — scores based on term statistics, not domain knowledge.
Vector retrieval (embedding + ANN) — scores based on learned semantic similarity.
Hybrid retrieval (concepts/hybrid-retrieval-bm25-vectors) — combined lexical + vector.

Heuristic retrieval can be combined with any of the above, but its defining property is that the rules are authored by engineers and encode domain structure (e.g. "only show changes in directories owned by teams whose systems show anomalies").

Canonical wiki reference¶

Meta's web-monorepo RCA system (2024-06; sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response) uses heuristic retrieval as its stage-1 filter. Meta's description verbatim:

"a novel heuristics-based retriever that is capable of reducing the search space from thousands of changes to a few hundred without significant reduction in accuracy using, for example, code and directory ownership or exploring the runtime code graph of impacted systems."

Two rule families named:

Code + directory ownership. Meta's monorepo has structural ownership metadata; the retriever restricts to changes in directories owned by teams whose systems are affected by the investigation.
Runtime code-graph exploration. The retriever walks the graph of which code paths were exercised by the impacted systems during the incident window, restricting to changes that touched code reachable from anomalous systems.

Why heuristic retrieval, not pure ML retrieval?¶

Three structural reasons at Meta's scale:

Latency floor. Heuristic rules run in milliseconds against existing ownership + code-graph databases; ML retrievers would require embedding every code change and doing ANN at scale.
Interpretability. "We include change X because directory is owned by impacted-system team" is a reproducible, audit-friendly filter. Closed feedback loops require this.
Domain structure is free signal. Monorepos with ownership metadata and runtime code graphs already encode high-precision features. Re-learning them via an ML retriever is redundant.

Meta's claim — "without significant reduction in accuracy" — implicitly says the heuristics don't discard the correct answer often enough to bottleneck the end-to-end pipeline. Retriever recall is the load-bearing metric; the post doesn't disclose a number, but the 42% top-5 end-to-end accuracy bounds the retriever's recall from below (at least 42% recall, probably higher).

The retrieve-then-rank architecture¶

Heuristic retrieval is the stage-1 of retrieve-then-rank-LLM:

candidates (~thousands)
    ↓  heuristic retrieval  (ownership + code graph)
candidates (~few hundred)
    ↓  LLM ranker  (ranking-via-election, logprobs)
top-K (5)

The asymmetric cost structure — heuristics run on every investigation, the LLM runs only on the narrowed set — is what makes the end-to-end system affordable at Meta's investigation volume.

Limits¶

Encoded knowledge, not learned knowledge. Heuristics reflect the rules the engineers who wrote them know about; emerging failure modes that don't pattern-match existing rules can slip through.
Retriever recall bounds end-to-end. If the correct change isn't in the retriever's output, the ranker cannot recover. Retriever recall is the hard ceiling on downstream accuracy.
Rule maintenance cost. Ownership metadata, code-graph derivation, and runtime-signal integrations require ongoing investment as the codebase evolves.
Monorepo affordance. Directory-ownership rules work because Meta's web monorepo makes ownership structured and queryable. In a multi-repo world, the rules would need an explicit cross-repo ownership service.

Seen in¶

sources/2024-08-23-meta-leveraging-ai-for-efficient-incident-response — canonical use in incident RCA.

concepts/llm-based-ranker — the downstream stage.
concepts/hybrid-retrieval-bm25-vectors — the ML retrieval alternative.
concepts/automated-root-cause-analysis — the capability class this retriever serves.
concepts/monorepo — the substrate that makes ownership-based retrieval tractable.
patterns/retrieve-then-rank-llm — the end-to-end pattern.
systems/meta-rca-system — canonical instance.