CLOUDFLARE

Build your own vulnerability harness¶

Summary¶

Cloudflare publishes a detailed practical guide to building a model-agnostic, fleet-wide vulnerability scanning harness — the architecture behind their Vulnerability Discovery Harness (VDH) and the newly-disclosed Vulnerability Validation System (VVS). The post moves beyond the earlier Project Glasswing conceptual disclosure to provide operational details: how to externalise state, manage multi-model pipelines, deduplicate at O(N) instead of O(N²), trace cross-repo dependencies, and measure effectiveness. The harness covers 128 repos, holds 13,841 findings in VVS across 145 repos, and has produced 7,245 actionable findings for engineering teams.

Key takeaways¶

Models are volatile; orchestration is durable. The harness treats LLMs as stateless, interchangeable compute engines. VDH uses one model, VVS uses a completely different model — forcing adversarial cross-checking by distinct sets of logical weights (Source: body §"A two-stage vulnerability research workflow").
Context exhaustion is the #1 engineering wall. Once the context window fills, the model cannibalises its own memory. Solution: externalise state entirely into a SQLite database keyed by (run_id, repo, stage) — each agent is hyper-focused, staying below 25% of the total window (Source: body §"Stage 1: VDH").
Persistence before parallelism. Every stage writes to one SQLite DB. Any stage can resume, retry, or get pulled into a later run without redoing work. Findings are streamed and saved as they happen — a crash costs only the task in flight (Source: body §"Stage 1: VDH").
Deduplication requires dedicated agents at scale. Simple string matching or file-path checks fail for complex logic flaws. Deterministic code builds inverted indexes over structured data (files, functions, trust boundaries, rare tokens) to generate a short candidate list; only then does an agent reason over the short list. Scales O(N) not O(N²) (Source: body §"Stage 2: VVS").
Sibling forking for scope deviation. When a hunter trips over an interesting but out-of-scope code path, it forks a sibling agent with a precise structural seed rather than wandering. Fleet-wide, forks account for 9-20% of tasks depending on model (Source: body §"Micro-forks and the wishlist").
The Wishlist as agent-to-human communication. When an agent needs a tool it doesn't have (FreeBSD VM, specific build environment, prod config files), it writes to a central wishlist — 25,472 entries across 128 repos. Some are self-healing via a generic coding harness monitoring logs (Source: body §"Micro-forks and the wishlist").
Trust requires threat-model-first, PoC-second, patch-third. A hunter must state the threat model before filing. Every confirmed finding ships with a PoC that runs against untouched source (prevents the agent from editing code to force exploits). Every finding also ships a proposed patch. The validator cannot log findings — its sole job is to disprove (Source: body §"Making findings you can trust").
The Fixer requires a fail→pass flip gate. Automated patch + regression test must produce a clean fail→pass transition on the target test. Failing post-patch tests block the commit. The fixer never merges on its own — human review is the non-negotiable gate (Source: body §"Stage 2: VVS").
Per-repo budgeting, not per-run. Cost varies wildly by repo. A strict task cap per repo + worker pool of 50-200 workers lets you spend on repos actually finding things. Full scans are periodic backlog sweeps, not per-PR checks — worst run took >14 hours (Source: body §"Stage 2: VVS").
Gapfill is the cost-to-coverage lever. Each additional gapfill pass costs roughly half as much as the initial hunt. Coverage measured by dividing repos into (area × attack-class) cells and running gapfill iteratively until it stops producing findings (Source: body §"Stage 2: VVS").

Operational numbers¶

128 repos scanned by VDH fleet
145 repos total findings in VVS (including other harness feeds)
20,799 raw candidates generated by VDH
12,057 survived VDH validation (initial rejection rate dropped from 40% → 11%)
13,841 total bugs in VVS after cross-harness merge
5,442 deduplicated away
1,154 wrong-repo / low-risk / recycled
7,245 actionable findings sent to teams
25,472 wishlist entries across 128 repos
50–200 concurrent workers per fleet scan
~14 hours worst-case single-repo scan duration
3–4 hours standard repo full run (~30K LoC → 100 initial findings)
5 min/bug average Fixer processing rate
~14 hours end-to-end discover → validate → deduplicate → open PRs for a standard repo
5 days mean-time-to-resolve for critical/exploitable (avg 10 of 80)
15–20 days incremental hardening window for remaining bugs
58% high-integrity finding rate (up from 35%)

Systems and concepts extracted¶

Systems¶

systems/cloudflare-vulnerability-discovery-harness — 8-stage pipeline (Recon → Hunt → Validate → Gapfill → Dedup → Trace → Feedback → Report)
systems/cloudflare-vulnerability-validation-system — 3-stage triage engine (Dedup → Judgment → Fixing) on a different model from VDH

Concepts¶

concepts/model-agnostic-orchestration — treat models as interchangeable components; vary across pipeline stages and cross-test
concepts/context-exhaustion — agent context window fills and model cannibalises its own memory; broken by externalising state
concepts/stateless-agent-compute — treat the LLM as a stateless compute engine; all state lives in the database
concepts/producer-consumer-loop — Gapfill + Feedback + Trace produce new tasks while Dedup + Validate + Report consume; continuous loop within a single run
concepts/inverted-index-deduplication — deterministic code builds inverted indexes over files/functions/trust-boundaries/tokens to generate short candidate lists for agent reasoning
concepts/coverage-cell — (area × attack-class) matrix cell; unit of measurement for gapfill completeness
concepts/shallow-run-detection — flag any hunter that finishes with zero findings as "shallow" and requeue; catches crashed dependencies vs clean codebases

Patterns¶

patterns/model-as-interchangeable-component — use one model for discovery and a different model for validation; cross-check by distinct logical weights
patterns/sqlite-keyed-stage-persistence — single SQLite DB keyed by (run_id, repo, stage); any stage can resume/retry; crash costs only the in-flight task
patterns/per-repo-budget-cap — budget per repo not per run; strict task cap + worker pool sizing (50-200) to allocate spend to productive repos
patterns/sibling-fork-for-scope-deviation — hunter forks a sibling agent with a precise structural seed when it encounters out-of-scope interesting paths; 9-20% of fleet tasks
patterns/wishlist-tool-for-agent-dependency — agents write dependency requests to a central wishlist; some self-heal via monitoring; 25,472 entries across 128 repos
patterns/adversarial-cross-model-validation — force Model B (different provider) to judge Model A's output as an unbiased adversarial third party
patterns/fail-pass-flip-gate — automated patch + targeted test must produce clean fail→pass flip; failing post-patch tests auto-block the commit
patterns/tiered-remediation-rollout — critical (avg 10/80) fast-tracked for 5-day resolution; remaining latent risks rolled into prod over 15-20 days

Caveats¶

The post is operational guidance, not a rigorous evaluation — no labeled ground-truth set exists, so recall claims are explicitly disclaimed.
Static analysis (Semgrep) was wired in but hunters invoked it zero times in a month of runs — they preferred reading and running code.
The harness is not public yet (only the initial ~450-line security-audit skill is released at github.com/cloudflare/security-audit-skill).
Numbers are "completely out of date by the time you're reading this" per the post.

Source¶

Original: https://blog.cloudflare.com/build-your-own-vulnerability-harness/
Raw markdown: raw/cloudflare/2026-06-18-build-your-own-vulnerability-harness-157fecee.md

sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — earlier conceptual disclosure of the harness
sources/2026-06-09-cloudflare-defend-against-frontier-cyber-models — Cloudflare's defensive architecture that the harness findings feed into
systems/cloudflare-ai-code-review — same coordinator/sub-reviewer shape applied to MR-time review