SYSTEM Cited by 2 sources

Cloudflare Vulnerability Discovery Harness¶

Cloudflare's vulnerability discovery harness is an 8-stage multi-agent pipeline built around Mythos Preview for AI-driven vulnerability research at coverage. It runs ~50 hunter agents concurrently, each fanning out to "a handful" of exploration sub-agents, with each hunter able to compile and run PoC code in a per-task scratch directory to materialise proofs of exploitability. Disclosed publicly in Cloudflare's 2026-05-18 Project Glasswing writeup after several months of internal use.

The harness was used to scan "runtime, edge data path, protocol stack, control plane, and the open-source projects we depend on" — "more than fifty of our own repositories".

Why a harness, not a chat interface¶

Cloudflare's first attempt was the obvious one: point a generic coding agent at a repository and ask for vulnerabilities. The post enumerates two structural failures of that approach (canonical wiki instances of concepts/single-agent-coverage-failure-on-large-repos and the context-rot failure mode applied to AI vuln research):

Context shape mismatch. "Coding agents are tuned for one focused stream of work … they ingest a lot of source code, hold a single hypothesis at a time, and iterate against it. That's exactly the wrong shape for vulnerability research, which is narrow and parallel by nature."
Coverage failure on large repos. "A single agent session (even with subagents) against a hundred-thousand- line repository can cover maybe a tenth of a percent of the surface in a useful way before the model's context window fills up and compaction kicks in — potentially discarding earlier findings that would have mattered."

Once that shape mismatch was internalised, "we stopped trying to make Mythos Preview do the wrong job and started building the harness around it instead." The harness was then bootstrapped by Mythos Preview itself: "We used Mythos Preview to build on, tailor, and improve our original harnesses to suit its strengths."

The 8 stages¶

(Canonical wiki instance of patterns/multi-stage-vulnerability-discovery-harness.)

Stage	What it does	Why it matters
Recon	An agent reads the repo top-down, fans out to subagents per subsystem, and produces an architecture document covering build commands, trust boundaries, entry points, and likely attack surface; generates the initial task queue.	"Gives every downstream agent shared context. Cuts the wander problem."
Hunt	Each task is one attack class + scope hint. ~50 hunters run concurrently, each fanning out to a handful of exploration sub-agents. Each hunter has tools that compile and run PoC code in a per-task scratch directory.	"This is where most of the work happens. Many narrow tasks in parallel, not one exhaustive agent."
Validate	An independent agent re-reads the code and tries to disprove the original finding. Different prompt, no ability to emit new findings.	"Catches a meaningful fraction of the noise the hunter wouldn't catch when reviewing its own work."
Gapfill	Hunters flag areas they touched but didn't cover thoroughly; those areas get re-queued.	"Counteracts the model's tendency to drift toward attack classes it has already had success with."
Dedupe	Findings sharing the same root cause collapse into a single record.	"Variant analysis is a feature, not a way to inflate the queue with duplicates."
Trace	For each confirmed finding in a shared library, a tracer agent fans out (one instance per consumer repo), uses a cross-repo symbol index, and decides whether attacker-controlled input actually reaches the bug from outside.	"Turns 'there is a flaw' into 'there is a reachable vulnerability.' This is the stage that matters most."
Feedback	Reachable traces become new hunt tasks in the consumer repos where the bug is exposed.	"Closes the loop. The pipeline gets better as it runs."
Report	A reporting agent writes against a predefined schema, fixes its own validation errors against that schema, and submits to an ingest API.	"Output is queryable data, not free-form prose."

Four design lessons embedded in the harness¶

The post extracts four lessons that "each one pointed to the need for a harness that manages the overall execution" — each crystallised as a wiki pattern:

patterns/narrow-scoped-agent-task — one attack class + scope hint + architecture document + prior coverage of this area. "Telling the model 'Find vulnerabilities in this repository' makes it wander."
patterns/adversarial-review-subagent — "Adding a second agent between the initial finding and the queue — one with a different prompt, a different model, and no ability to generate its own findings." Vulnerability-research instance of the wiki's existing adversarial review persona thread.
patterns/split-bug-and-reachability-questions — "Asking 'Is this code buggy?' and 'Can an attacker actually reach this bug from outside the system?' are two different questions, and the model is better at each one when you ask them separately."
patterns/parallel-narrow-agents-over-exhaustive — "Coverage improves when many agents work on tightly scoped questions and we deduplicate the results afterward, rather than asking one agent to be exhaustive."

Architectural relationships¶

Hunt stage uses patterns/proof-by-compile-and-run — per-task scratch directory + compile + run + read-failure + adjust-hypothesis loop, the second core capability of Mythos Preview.
Trace stage uses patterns/cross-repo-tracer-fan-out — one tracer instance per consumer repository, querying a cross-repo symbol index to decide reachability.
Gapfill stage uses patterns/gapfill-requeue-for-coverage — hunters self-report under-covered areas → those areas are re-queued as new hunt tasks.
Report stage uses patterns/report-agent-self-validates-schema — agent writes against a predefined schema and fixes its own validation errors before submitting.
Sibling Cloudflare harness: Cloudflare AI Code Review — same coordinator/sub-reviewer shape via patterns/coordinator-sub-reviewer-orchestration in a different domain (CI-native code review at MR time vs vulnerability discovery at fleet scale). The 2026-05-18 Glasswing post is the first wiki disclosure that the same architectural pattern shape spans Cloudflare's offensive- security and code-quality pipelines.

Numbers disclosed¶

Datum	Value
Concurrent hunters	"typically around fifty at once"
Per-hunter sub-agents	"a handful" of exploration sub-agents
Per-task isolation	per-task scratch directory for PoC compile/run
Repos scanned	"more than fifty"
Surfaces scanned	runtime, edge data path, protocol stack, control plane, OSS deps

Per-stage noise-reduction percentages, per-MR-equivalent token spend, and quantitative bug counts are not disclosed.

Open / not disclosed¶

Reporting schema — "a predefined schema" but the schema itself is not published.
Cross-repo symbol index implementation — named as a capability of the Trace stage; the underlying tool is not named.
Whether hunters all run on Mythos Preview or on a mix of models — Cloudflare names that "the validator runs on a different model" but does not enumerate the per-stage model assignment.
Public CVE flow — Cloudflare states that everything surfaced is run through "Cloudflare's formal vulnerability management process" but does not disclose a separate flow to public CVE channels for findings in upstream OSS.

Seen in¶

sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — first and canonical wiki disclosure: 8-stage pipeline, ~50 concurrent hunters, per-task scratch dirs, cross-repo tracer, schema-validated reporting, dogfood-build via Mythos Preview.

systems/mythos-preview — the engine behind the hunters.
systems/anthropic-project-glasswing — the partner program under which the harness was tuned.
systems/cloudflare-ai-code-review — sibling Cloudflare multi-agent pipeline (CI-time MR review).
patterns/multi-stage-vulnerability-discovery-harness — the canonical pattern shape.
patterns/coordinator-sub-reviewer-orchestration — the generalised orchestration pattern.
concepts/signal-to-noise-in-ai-vulnerability-triage — the failure mode the harness exists to control.
companies/cloudflare — the operator.

Operational details (2026-06-18 disclosure)¶

The "Build your own vulnerability harness" post (Source: sources/2026-06-18-cloudflare-build-your-own-vulnerability-harness) disclosed production-scale operational details not present in the earlier Glasswing writeup:

State management¶

All state is externalised to a single SQLite database keyed by (run_id, repo, stage). Any stage can resume, retry, or be pulled into a later run without redoing work. Findings are streamed and saved as they happen — a crash costs only the task in flight (patterns/sqlite-keyed-stage-persistence).

Each agent's context usage is kept below 25% of the total window — a "naive read-all-files approach will blow past this limit every single time."

Model agnosticism¶

Models are treated as interchangeable compute engines (concepts/model-agnostic-orchestration). VDH uses one model; the downstream VVS uses a completely different model, so findings are cross-checked by distinct logical weights. The harness absorbs downstream model-provider volatility (temperature, caching, inference-effort changes) without breaking.

Producer-consumer loop¶

Stages 4–8 (Gapfill, Feedback, Trace, Dedup, Report) run as a continuous producer-consumer loop. As the initial hunt progresses, Gapfill + Feedback + Trace generate new tasks; Dedup folds overlapping findings; the rest of the loop keeps consuming the queue. A vulnerability discovered late in the cycle is still validated and traced within the same run.

Sibling forking and the Wishlist¶

Sibling forking (patterns/sibling-fork-for-scope-deviation): when a hunter encounters an interesting but out-of-scope code path, it forks a sibling agent with a precise structural seed. Fleet-wide, 9–20% of tasks depending on model.

The Wishlist (patterns/wishlist-tool-for-agent-dependency): when an agent needs a tool it doesn't have (FreeBSD VM, specific build env, prod config), it writes to a central wishlist. 25,472 entries across 128 repos. Some self-heal via a generic coding harness monitoring logs.

Fleet-wide scale¶

128 distinct repos scanned
50–200 concurrent workers per fleet scan
14+ hours worst-case single-repo run
3–4 hours standard repo (~30K LoC, 100 initial findings)
Initial validation rejection rate dropped from 40% → 11%
High-integrity finding share climbed from 35% → 58% (~12,057 lifetime findings)
20,799 raw candidates generated; 12,057 survived validation

Coverage measurement¶

Repos are divided into (area × attack-class) cells (concepts/coverage-cell). Gapfill runs iteratively until it stops producing findings. When prompts are updated, they are tested against a held-out repository to confirm the coverage-cell count actually moves.

Health signals¶

Any hunter that finishes with zero findings is flagged as "shallow" (concepts/shallow-run-detection) and immediately requeued — usually indicates a crashed dependency rather than a clean codebase.

Static analysis not adopted¶

Semgrep was wired all the way through, but hunters invoked it zero times in a month of runs. They preferred reading and running code. The Wishlist was the single most-used tool in the system.