CONCEPT Cited by 1 source

Inverted-index deduplication¶

A scaling technique for deduplicating AI-generated findings at fleet scale by using deterministic code to build inverted indexes over structured data, generating a short candidate list for agent reasoning — keeping the model off the critical O(N²) path.

Problem¶

Comparing every finding against every other using an LLM scales O(N²) — falls apart completely at scale (13,841+ findings across 145 repos). Simple string matching or file-path checks fail for complex logic flaws where two findings share the same root cause but manifest differently.

Solution¶

Deterministic pre-filter: plain code builds inverted indexes over structured data — touched files, affected functions, trust boundaries, rare tokens.
Short candidate list: each new finding is compared only against its deterministic-match candidates (typically <10).
Agent reasoning: only then does an LLM examine the short list to determine if a single fix would close several findings.
Stable cross-run keys: re-found bugs reopen existing records rather than spawning new entries.

This keeps deduplication at O(N) amortised complexity.

Seen in¶

systems/cloudflare-vulnerability-validation-system — dedup stage folded away 5,442 findings from a pool of 13,841

concepts/producer-consumer-loop — dedup runs continuously as new findings arrive
patterns/adversarial-cross-model-validation — different model validates the dedup agent's decisions

Inverted-index deduplication¶

Problem¶

Solution¶

Seen in¶

Related¶