Skip to content

CONCEPT Cited by 1 source

Single-agent coverage failure on large repos

Definition

Single-agent coverage failure is the empirical observation that a single LLM-coding-agent session, no matter how capable the model, cannot achieve useful coverage on a real-world hundred-thousand-line repository when the task is hypothesis-parallel by nature (vulnerability research, performance hotspot search, architectural-debt audit, etc.). The failure has two distinct mechanisms that compound.

Cloudflare's canonical articulation

From the 2026-05-18 Project Glasswing writeup arguing for a multi-agent harness over a generic-coding-agent approach:

"Coding agents are tuned for one focused stream of work: building a feature, fixing a bug, writing a refactor. They ingest a lot of source code, hold a single hypothesis at a time, and iterate against it. That's exactly the wrong shape for vulnerability research, which is narrow and parallel by nature. A human researcher picks one specific thing to look at and investigates it thoroughly. That one thing might be a single complex feature, transitions across security boundaries, or a specific vulnerability class … Then they do it again, for a different feature, security boundary, or vulnerability class, several thousand times across the codebase."

"A single agent session (even with subagents) against a hundred-thousand-line repository can cover maybe a tenth of a percent of the surface in a useful way before the model's context window fills up and compaction kicks in — potentially discarding earlier findings that would have mattered."

The two pieces frame the two mechanisms.

Two compounding failure mechanisms

1. Shape mismatch — narrow & parallel vs single & sequential

Vulnerability research (and the analogous classes — performance hotspot search, dead-code audit, architectural- debt scan) is inherently parallel: thousands of small, independent investigations, each with its own scope. A single-agent session serialises that work behind one context window. Even if the model handled each individual investigation perfectly, the coverage rate per session is bounded by the number of investigations the agent can fit between the prompt and the context-window ceiling.

Cloudflare's first-person datum: "single-stream agent does one thing at a time, but real codebases need many hypotheses against many components at once."

2. Context-rot-driven coverage loss

Even within the bound that the shape mismatch sets, the coverage is further reduced by context rot: as accumulated tool output, intermediate code reads, and prior-finding context fill the window, the model's accuracy degrades and compaction discards earlier context — "potentially discarding earlier findings that would have mattered".

The compaction effect is load-bearing for vuln research specifically because findings accumulate across the session. A coding agent fixing one bug rolls forward in a single hypothesis trajectory; a vuln-research agent maintains a list of "things found, by location, with hedge" that compaction can drop.

The "tenth of a percent" datum

Cloudflare's specific quantification — "maybe a tenth of a percent of the surface in a useful way" — sets the scale. On a 100,000-LoC repository, that's ~100 lines of useful coverage per single-agent session. Even spinning up hundreds of single-agent sessions per repo, the coverage deficit is structural: each session burns context-window budget on repository orientation and architecture reconstruction before it can investigate anything specific.

Why subagents-of-a-single-session don't fix it

Cloudflare's parenthetical is direct: "even with subagents". A single agent session that fans out to subagents still has the parent context window as a scarce resource — the parent's job of orchestrating subagents and aggregating their results consumes the same budget that would otherwise go to investigation. Sub-agent fan-out from a single parent is an intra-session optimisation; coverage requires inter-session parallelism (many independent agent sessions, each with their own fresh context window), delegated and queued externally.

This is the core architectural lesson driving patterns/parallel-narrow-agents-over-exhaustive and patterns/multi-stage-vulnerability-discovery-harness.

Sibling failure mode comparison

Concept Failure axis Where it bites
Single-agent coverage failure Coverage / breadth Many-hypothesis tasks on large surfaces
concepts/agent-hyperfixation-failure-mode Reasoning-path commitment Single-hypothesis tasks where the agent commits to the wrong angle
concepts/context-rot Accuracy over time Long sessions in general
concepts/agent-context-window Token budget constraint Mechanism that drives the others

The wiki's pre-existing thread on context-window failure modes was framed for coding tasks (Vercel's Turborepo agent experiment, Dropbox Dash). Cloudflare's 2026-05-18 post is the first wiki articulation of the same family applied to vulnerability research, where the hypothesis-parallel structure makes the single-agent ceiling far more painful than in feature-build tasks.

Architectural lever: external delegation, not internal subagents

The fix is structural — split work into independent sessions, each with its own context, queued by an external orchestrator (the Recon stage of Cloudflare's harness produces "the initial queue of tasks for the next stage"). Cloudflare's verbatim:

"Once we accepted that, we stopped trying to make Mythos Preview do the wrong job and started building the harness around it instead."

This is also why the patterns/coordinator-sub-reviewer-orchestration / patterns/multi-stage-vulnerability-discovery-harness patterns explicitly use multiple fresh-context agent spawns rather than one large agent with internal subagents.

Open / not disclosed

  • What model size makes this better? Cloudflare doesn't benchmark coverage as a function of context-window length; the failure mode is described as structural, not scale-dependent.
  • Compaction strategies that preserve findings. "context window fills up and compaction kicks in" names the mechanism but not the specific compaction algorithm or what could be preserved.

Seen in

Last updated · 542 distilled / 1,571 read