Build your own vulnerability harness¶
Summary¶
Cloudflare publishes a detailed practical guide to building a model-agnostic, fleet-wide vulnerability scanning harness — the architecture behind their Vulnerability Discovery Harness (VDH) and the newly-disclosed Vulnerability Validation System (VVS). The post moves beyond the earlier Project Glasswing conceptual disclosure to provide operational details: how to externalise state, manage multi-model pipelines, deduplicate at O(N) instead of O(N²), trace cross-repo dependencies, and measure effectiveness. The harness covers 128 repos, holds 13,841 findings in VVS across 145 repos, and has produced 7,245 actionable findings for engineering teams.
Key takeaways¶
-
Models are volatile; orchestration is durable. The harness treats LLMs as stateless, interchangeable compute engines. VDH uses one model, VVS uses a completely different model — forcing adversarial cross-checking by distinct sets of logical weights (Source: body §"A two-stage vulnerability research workflow").
-
Context exhaustion is the #1 engineering wall. Once the context window fills, the model cannibalises its own memory. Solution: externalise state entirely into a SQLite database keyed by
(run_id, repo, stage)— each agent is hyper-focused, staying below 25% of the total window (Source: body §"Stage 1: VDH"). -
Persistence before parallelism. Every stage writes to one SQLite DB. Any stage can resume, retry, or get pulled into a later run without redoing work. Findings are streamed and saved as they happen — a crash costs only the task in flight (Source: body §"Stage 1: VDH").
-
Deduplication requires dedicated agents at scale. Simple string matching or file-path checks fail for complex logic flaws. Deterministic code builds inverted indexes over structured data (files, functions, trust boundaries, rare tokens) to generate a short candidate list; only then does an agent reason over the short list. Scales O(N) not O(N²) (Source: body §"Stage 2: VVS").
-
Sibling forking for scope deviation. When a hunter trips over an interesting but out-of-scope code path, it forks a sibling agent with a precise structural seed rather than wandering. Fleet-wide, forks account for 9-20% of tasks depending on model (Source: body §"Micro-forks and the wishlist").
-
The Wishlist as agent-to-human communication. When an agent needs a tool it doesn't have (FreeBSD VM, specific build environment, prod config files), it writes to a central wishlist — 25,472 entries across 128 repos. Some are self-healing via a generic coding harness monitoring logs (Source: body §"Micro-forks and the wishlist").
-
Trust requires threat-model-first, PoC-second, patch-third. A hunter must state the threat model before filing. Every confirmed finding ships with a PoC that runs against untouched source (prevents the agent from editing code to force exploits). Every finding also ships a proposed patch. The validator cannot log findings — its sole job is to disprove (Source: body §"Making findings you can trust").
-
The Fixer requires a fail→pass flip gate. Automated patch + regression test must produce a clean fail→pass transition on the target test. Failing post-patch tests block the commit. The fixer never merges on its own — human review is the non-negotiable gate (Source: body §"Stage 2: VVS").
-
Per-repo budgeting, not per-run. Cost varies wildly by repo. A strict task cap per repo + worker pool of 50-200 workers lets you spend on repos actually finding things. Full scans are periodic backlog sweeps, not per-PR checks — worst run took >14 hours (Source: body §"Stage 2: VVS").
-
Gapfill is the cost-to-coverage lever. Each additional gapfill pass costs roughly half as much as the initial hunt. Coverage measured by dividing repos into (area × attack-class) cells and running gapfill iteratively until it stops producing findings (Source: body §"Stage 2: VVS").
Operational numbers¶
- 128 repos scanned by VDH fleet
- 145 repos total findings in VVS (including other harness feeds)
- 20,799 raw candidates generated by VDH
- 12,057 survived VDH validation (initial rejection rate dropped from 40% → 11%)
- 13,841 total bugs in VVS after cross-harness merge
- 5,442 deduplicated away
- 1,154 wrong-repo / low-risk / recycled
- 7,245 actionable findings sent to teams
- 25,472 wishlist entries across 128 repos
- 50–200 concurrent workers per fleet scan
- ~14 hours worst-case single-repo scan duration
- 3–4 hours standard repo full run (~30K LoC → 100 initial findings)
- 5 min/bug average Fixer processing rate
- ~14 hours end-to-end discover → validate → deduplicate → open PRs for a standard repo
- 5 days mean-time-to-resolve for critical/exploitable (avg 10 of 80)
- 15–20 days incremental hardening window for remaining bugs
- 58% high-integrity finding rate (up from 35%)
Systems and concepts extracted¶
Systems¶
- systems/cloudflare-vulnerability-discovery-harness — 8-stage pipeline (Recon → Hunt → Validate → Gapfill → Dedup → Trace → Feedback → Report)
- systems/cloudflare-vulnerability-validation-system — 3-stage triage engine (Dedup → Judgment → Fixing) on a different model from VDH
Concepts¶
- concepts/model-agnostic-orchestration — treat models as interchangeable components; vary across pipeline stages and cross-test
- concepts/context-exhaustion — agent context window fills and model cannibalises its own memory; broken by externalising state
- concepts/stateless-agent-compute — treat the LLM as a stateless compute engine; all state lives in the database
- concepts/producer-consumer-loop — Gapfill + Feedback + Trace produce new tasks while Dedup + Validate + Report consume; continuous loop within a single run
- concepts/inverted-index-deduplication — deterministic code builds inverted indexes over files/functions/trust-boundaries/tokens to generate short candidate lists for agent reasoning
- concepts/coverage-cell — (area × attack-class) matrix cell; unit of measurement for gapfill completeness
- concepts/shallow-run-detection — flag any hunter that finishes with zero findings as "shallow" and requeue; catches crashed dependencies vs clean codebases
Patterns¶
- patterns/model-as-interchangeable-component — use one model for discovery and a different model for validation; cross-check by distinct logical weights
- patterns/sqlite-keyed-stage-persistence — single SQLite DB keyed by (run_id, repo, stage); any stage can resume/retry; crash costs only the in-flight task
- patterns/per-repo-budget-cap — budget per repo not per run; strict task cap + worker pool sizing (50-200) to allocate spend to productive repos
- patterns/sibling-fork-for-scope-deviation — hunter forks a sibling agent with a precise structural seed when it encounters out-of-scope interesting paths; 9-20% of fleet tasks
- patterns/wishlist-tool-for-agent-dependency — agents write dependency requests to a central wishlist; some self-heal via monitoring; 25,472 entries across 128 repos
- patterns/adversarial-cross-model-validation — force Model B (different provider) to judge Model A's output as an unbiased adversarial third party
- patterns/fail-pass-flip-gate — automated patch + targeted test must produce clean fail→pass flip; failing post-patch tests auto-block the commit
- patterns/tiered-remediation-rollout — critical (avg 10/80) fast-tracked for 5-day resolution; remaining latent risks rolled into prod over 15-20 days
Caveats¶
- The post is operational guidance, not a rigorous evaluation — no labeled ground-truth set exists, so recall claims are explicitly disclaimed.
- Static analysis (Semgrep) was wired in but hunters invoked it zero times in a month of runs — they preferred reading and running code.
- The harness is not public yet (only the initial ~450-line security-audit skill is released at github.com/cloudflare/security-audit-skill).
- Numbers are "completely out of date by the time you're reading this" per the post.
Source¶
- Original: https://blog.cloudflare.com/build-your-own-vulnerability-harness/
- Raw markdown:
raw/cloudflare/2026-06-18-build-your-own-vulnerability-harness-157fecee.md
Related¶
- sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — earlier conceptual disclosure of the harness
- sources/2026-06-09-cloudflare-defend-against-frontier-cyber-models — Cloudflare's defensive architecture that the harness findings feed into
- systems/cloudflare-ai-code-review — same coordinator/sub-reviewer shape applied to MR-time review