Skip to content

SYSTEM Cited by 2 sources

Cloudflare Vulnerability Discovery Harness

Cloudflare's vulnerability discovery harness is an 8-stage multi-agent pipeline built around Mythos Preview for AI-driven vulnerability research at coverage. It runs ~50 hunter agents concurrently, each fanning out to "a handful" of exploration sub-agents, with each hunter able to compile and run PoC code in a per-task scratch directory to materialise proofs of exploitability. Disclosed publicly in Cloudflare's 2026-05-18 Project Glasswing writeup after several months of internal use.

The harness was used to scan "runtime, edge data path, protocol stack, control plane, and the open-source projects we depend on""more than fifty of our own repositories".

Why a harness, not a chat interface

Cloudflare's first attempt was the obvious one: point a generic coding agent at a repository and ask for vulnerabilities. The post enumerates two structural failures of that approach (canonical wiki instances of concepts/single-agent-coverage-failure-on-large-repos and the context-rot failure mode applied to AI vuln research):

  • Context shape mismatch. "Coding agents are tuned for one focused stream of work … they ingest a lot of source code, hold a single hypothesis at a time, and iterate against it. That's exactly the wrong shape for vulnerability research, which is narrow and parallel by nature."
  • Coverage failure on large repos. "A single agent session (even with subagents) against a hundred-thousand- line repository can cover maybe a tenth of a percent of the surface in a useful way before the model's context window fills up and compaction kicks in — potentially discarding earlier findings that would have mattered."

Once that shape mismatch was internalised, "we stopped trying to make Mythos Preview do the wrong job and started building the harness around it instead." The harness was then bootstrapped by Mythos Preview itself: "We used Mythos Preview to build on, tailor, and improve our original harnesses to suit its strengths."

The 8 stages

(Canonical wiki instance of patterns/multi-stage-vulnerability-discovery-harness.)

Stage What it does Why it matters
Recon An agent reads the repo top-down, fans out to subagents per subsystem, and produces an architecture document covering build commands, trust boundaries, entry points, and likely attack surface; generates the initial task queue. "Gives every downstream agent shared context. Cuts the wander problem."
Hunt Each task is one attack class + scope hint. ~50 hunters run concurrently, each fanning out to a handful of exploration sub-agents. Each hunter has tools that compile and run PoC code in a per-task scratch directory. "This is where most of the work happens. Many narrow tasks in parallel, not one exhaustive agent."
Validate An independent agent re-reads the code and tries to disprove the original finding. Different prompt, no ability to emit new findings. "Catches a meaningful fraction of the noise the hunter wouldn't catch when reviewing its own work."
Gapfill Hunters flag areas they touched but didn't cover thoroughly; those areas get re-queued. "Counteracts the model's tendency to drift toward attack classes it has already had success with."
Dedupe Findings sharing the same root cause collapse into a single record. "Variant analysis is a feature, not a way to inflate the queue with duplicates."
Trace For each confirmed finding in a shared library, a tracer agent fans out (one instance per consumer repo), uses a cross-repo symbol index, and decides whether attacker-controlled input actually reaches the bug from outside. "Turns 'there is a flaw' into 'there is a reachable vulnerability.' This is the stage that matters most."
Feedback Reachable traces become new hunt tasks in the consumer repos where the bug is exposed. "Closes the loop. The pipeline gets better as it runs."
Report A reporting agent writes against a predefined schema, fixes its own validation errors against that schema, and submits to an ingest API. "Output is queryable data, not free-form prose."

Four design lessons embedded in the harness

The post extracts four lessons that "each one pointed to the need for a harness that manages the overall execution" — each crystallised as a wiki pattern:

  • patterns/narrow-scoped-agent-task — one attack class + scope hint + architecture document + prior coverage of this area. "Telling the model 'Find vulnerabilities in this repository' makes it wander."
  • patterns/adversarial-review-subagent"Adding a second agent between the initial finding and the queue — one with a different prompt, a different model, and no ability to generate its own findings." Vulnerability-research instance of the wiki's existing adversarial review persona thread.
  • patterns/split-bug-and-reachability-questions"Asking 'Is this code buggy?' and 'Can an attacker actually reach this bug from outside the system?' are two different questions, and the model is better at each one when you ask them separately."
  • patterns/parallel-narrow-agents-over-exhaustive"Coverage improves when many agents work on tightly scoped questions and we deduplicate the results afterward, rather than asking one agent to be exhaustive."

Architectural relationships

Numbers disclosed

Datum Value
Concurrent hunters "typically around fifty at once"
Per-hunter sub-agents "a handful" of exploration sub-agents
Per-task isolation per-task scratch directory for PoC compile/run
Repos scanned "more than fifty"
Surfaces scanned runtime, edge data path, protocol stack, control plane, OSS deps

Per-stage noise-reduction percentages, per-MR-equivalent token spend, and quantitative bug counts are not disclosed.

Open / not disclosed

  • Reporting schema"a predefined schema" but the schema itself is not published.
  • Cross-repo symbol index implementation — named as a capability of the Trace stage; the underlying tool is not named.
  • Whether hunters all run on Mythos Preview or on a mix of models — Cloudflare names that "the validator runs on a different model" but does not enumerate the per-stage model assignment.
  • Public CVE flow — Cloudflare states that everything surfaced is run through "Cloudflare's formal vulnerability management process" but does not disclose a separate flow to public CVE channels for findings in upstream OSS.

Seen in

Operational details (2026-06-18 disclosure)

The "Build your own vulnerability harness" post (Source: sources/2026-06-18-cloudflare-build-your-own-vulnerability-harness) disclosed production-scale operational details not present in the earlier Glasswing writeup:

State management

All state is externalised to a single SQLite database keyed by (run_id, repo, stage). Any stage can resume, retry, or be pulled into a later run without redoing work. Findings are streamed and saved as they happen — a crash costs only the task in flight (patterns/sqlite-keyed-stage-persistence).

Each agent's context usage is kept below 25% of the total window — a "naive read-all-files approach will blow past this limit every single time."

Model agnosticism

Models are treated as interchangeable compute engines (concepts/model-agnostic-orchestration). VDH uses one model; the downstream VVS uses a completely different model, so findings are cross-checked by distinct logical weights. The harness absorbs downstream model-provider volatility (temperature, caching, inference-effort changes) without breaking.

Producer-consumer loop

Stages 4–8 (Gapfill, Feedback, Trace, Dedup, Report) run as a continuous producer-consumer loop. As the initial hunt progresses, Gapfill + Feedback + Trace generate new tasks; Dedup folds overlapping findings; the rest of the loop keeps consuming the queue. A vulnerability discovered late in the cycle is still validated and traced within the same run.

Sibling forking and the Wishlist

Sibling forking (patterns/sibling-fork-for-scope-deviation): when a hunter encounters an interesting but out-of-scope code path, it forks a sibling agent with a precise structural seed. Fleet-wide, 9–20% of tasks depending on model.

The Wishlist (patterns/wishlist-tool-for-agent-dependency): when an agent needs a tool it doesn't have (FreeBSD VM, specific build env, prod config), it writes to a central wishlist. 25,472 entries across 128 repos. Some self-heal via a generic coding harness monitoring logs.

Fleet-wide scale

  • 128 distinct repos scanned
  • 50–200 concurrent workers per fleet scan
  • 14+ hours worst-case single-repo run
  • 3–4 hours standard repo (~30K LoC, 100 initial findings)
  • Initial validation rejection rate dropped from 40% → 11%
  • High-integrity finding share climbed from 35% → 58% (~12,057 lifetime findings)
  • 20,799 raw candidates generated; 12,057 survived validation

Coverage measurement

Repos are divided into (area × attack-class) cells (concepts/coverage-cell). Gapfill runs iteratively until it stops producing findings. When prompts are updated, they are tested against a held-out repository to confirm the coverage-cell count actually moves.

Health signals

Any hunter that finishes with zero findings is flagged as "shallow" (concepts/shallow-run-detection) and immediately requeued — usually indicates a crashed dependency rather than a clean codebase.

Static analysis not adopted

Semgrep was wired all the way through, but hunters invoked it zero times in a month of runs. They preferred reading and running code. The Wishlist was the single most-used tool in the system.

Last updated · 542 distilled / 1,571 read