Skip to content

CLOUDFLARE

Read original ↗

Build your own vulnerability harness

Summary

Cloudflare publishes a detailed practical guide to building a model-agnostic, fleet-wide vulnerability scanning harness — the architecture behind their Vulnerability Discovery Harness (VDH) and the newly-disclosed Vulnerability Validation System (VVS). The post moves beyond the earlier Project Glasswing conceptual disclosure to provide operational details: how to externalise state, manage multi-model pipelines, deduplicate at O(N) instead of O(N²), trace cross-repo dependencies, and measure effectiveness. The harness covers 128 repos, holds 13,841 findings in VVS across 145 repos, and has produced 7,245 actionable findings for engineering teams.

Key takeaways

  1. Models are volatile; orchestration is durable. The harness treats LLMs as stateless, interchangeable compute engines. VDH uses one model, VVS uses a completely different model — forcing adversarial cross-checking by distinct sets of logical weights (Source: body §"A two-stage vulnerability research workflow").

  2. Context exhaustion is the #1 engineering wall. Once the context window fills, the model cannibalises its own memory. Solution: externalise state entirely into a SQLite database keyed by (run_id, repo, stage) — each agent is hyper-focused, staying below 25% of the total window (Source: body §"Stage 1: VDH").

  3. Persistence before parallelism. Every stage writes to one SQLite DB. Any stage can resume, retry, or get pulled into a later run without redoing work. Findings are streamed and saved as they happen — a crash costs only the task in flight (Source: body §"Stage 1: VDH").

  4. Deduplication requires dedicated agents at scale. Simple string matching or file-path checks fail for complex logic flaws. Deterministic code builds inverted indexes over structured data (files, functions, trust boundaries, rare tokens) to generate a short candidate list; only then does an agent reason over the short list. Scales O(N) not O(N²) (Source: body §"Stage 2: VVS").

  5. Sibling forking for scope deviation. When a hunter trips over an interesting but out-of-scope code path, it forks a sibling agent with a precise structural seed rather than wandering. Fleet-wide, forks account for 9-20% of tasks depending on model (Source: body §"Micro-forks and the wishlist").

  6. The Wishlist as agent-to-human communication. When an agent needs a tool it doesn't have (FreeBSD VM, specific build environment, prod config files), it writes to a central wishlist — 25,472 entries across 128 repos. Some are self-healing via a generic coding harness monitoring logs (Source: body §"Micro-forks and the wishlist").

  7. Trust requires threat-model-first, PoC-second, patch-third. A hunter must state the threat model before filing. Every confirmed finding ships with a PoC that runs against untouched source (prevents the agent from editing code to force exploits). Every finding also ships a proposed patch. The validator cannot log findings — its sole job is to disprove (Source: body §"Making findings you can trust").

  8. The Fixer requires a fail→pass flip gate. Automated patch + regression test must produce a clean fail→pass transition on the target test. Failing post-patch tests block the commit. The fixer never merges on its own — human review is the non-negotiable gate (Source: body §"Stage 2: VVS").

  9. Per-repo budgeting, not per-run. Cost varies wildly by repo. A strict task cap per repo + worker pool of 50-200 workers lets you spend on repos actually finding things. Full scans are periodic backlog sweeps, not per-PR checks — worst run took >14 hours (Source: body §"Stage 2: VVS").

  10. Gapfill is the cost-to-coverage lever. Each additional gapfill pass costs roughly half as much as the initial hunt. Coverage measured by dividing repos into (area × attack-class) cells and running gapfill iteratively until it stops producing findings (Source: body §"Stage 2: VVS").

Operational numbers

  • 128 repos scanned by VDH fleet
  • 145 repos total findings in VVS (including other harness feeds)
  • 20,799 raw candidates generated by VDH
  • 12,057 survived VDH validation (initial rejection rate dropped from 40% → 11%)
  • 13,841 total bugs in VVS after cross-harness merge
  • 5,442 deduplicated away
  • 1,154 wrong-repo / low-risk / recycled
  • 7,245 actionable findings sent to teams
  • 25,472 wishlist entries across 128 repos
  • 50–200 concurrent workers per fleet scan
  • ~14 hours worst-case single-repo scan duration
  • 3–4 hours standard repo full run (~30K LoC → 100 initial findings)
  • 5 min/bug average Fixer processing rate
  • ~14 hours end-to-end discover → validate → deduplicate → open PRs for a standard repo
  • 5 days mean-time-to-resolve for critical/exploitable (avg 10 of 80)
  • 15–20 days incremental hardening window for remaining bugs
  • 58% high-integrity finding rate (up from 35%)

Systems and concepts extracted

Systems

Concepts

Patterns

Caveats

  • The post is operational guidance, not a rigorous evaluation — no labeled ground-truth set exists, so recall claims are explicitly disclaimed.
  • Static analysis (Semgrep) was wired in but hunters invoked it zero times in a month of runs — they preferred reading and running code.
  • The harness is not public yet (only the initial ~450-line security-audit skill is released at github.com/cloudflare/security-audit-skill).
  • Numbers are "completely out of date by the time you're reading this" per the post.

Source

Last updated · 542 distilled / 1,571 read