Skip to content

CONCEPT Cited by 1 source

Inverted-index deduplication

A scaling technique for deduplicating AI-generated findings at fleet scale by using deterministic code to build inverted indexes over structured data, generating a short candidate list for agent reasoning — keeping the model off the critical O(N²) path.

Problem

Comparing every finding against every other using an LLM scales O(N²) — falls apart completely at scale (13,841+ findings across 145 repos). Simple string matching or file-path checks fail for complex logic flaws where two findings share the same root cause but manifest differently.

Solution

  1. Deterministic pre-filter: plain code builds inverted indexes over structured data — touched files, affected functions, trust boundaries, rare tokens.
  2. Short candidate list: each new finding is compared only against its deterministic-match candidates (typically <10).
  3. Agent reasoning: only then does an LLM examine the short list to determine if a single fix would close several findings.
  4. Stable cross-run keys: re-found bugs reopen existing records rather than spawning new entries.

This keeps deduplication at O(N) amortised complexity.

Seen in

Last updated · 542 distilled / 1,571 read