CONCEPT Cited by 1 source
Inverted-index deduplication¶
A scaling technique for deduplicating AI-generated findings at fleet scale by using deterministic code to build inverted indexes over structured data, generating a short candidate list for agent reasoning — keeping the model off the critical O(N²) path.
Problem¶
Comparing every finding against every other using an LLM scales O(N²) — falls apart completely at scale (13,841+ findings across 145 repos). Simple string matching or file-path checks fail for complex logic flaws where two findings share the same root cause but manifest differently.
Solution¶
- Deterministic pre-filter: plain code builds inverted indexes over structured data — touched files, affected functions, trust boundaries, rare tokens.
- Short candidate list: each new finding is compared only against its deterministic-match candidates (typically <10).
- Agent reasoning: only then does an LLM examine the short list to determine if a single fix would close several findings.
- Stable cross-run keys: re-found bugs reopen existing records rather than spawning new entries.
This keeps deduplication at O(N) amortised complexity.
Seen in¶
- systems/cloudflare-vulnerability-validation-system — dedup stage folded away 5,442 findings from a pool of 13,841
Related¶
- concepts/producer-consumer-loop — dedup runs continuously as new findings arrive
- patterns/adversarial-cross-model-validation — different model validates the dedup agent's decisions