CONCEPT Cited by 1 source

Model bias toward finding something¶

Definition¶

Model bias toward finding something is the empirically observed tendency of LLMs, when asked to perform an exploratory task ("find bugs in this code", "find issues with this design", "look for problems with this plan") to emit findings whether or not the input contains them — and to hedge those findings with "possibly" / "potentially" / "could in theory" qualifiers that do not reduce per-finding triage cost.

Cloudflare's verbatim canonical articulation (Source: sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us):

"A good human researcher tells you what they found and how confident they are. Models don't. Ask a model to find bugs, and it will find them, whether the code has any or not. Findings come back hedged with 'possibly,' 'potentially,' 'could in theory,' and the hedged findings vastly outnumber the solid ones."

The wiki canonicalises this in the AI-vulnerability-research context, but the failure mode generalises to any exploratory-prompt + LLM combination.

Why this isn't quite hallucination¶

The wiki's existing LLM hallucination concept covers fabricated facts. Model bias toward finding-something is subtly different: the model isn't fabricating a fact about the world; it's performing the task it was asked to do even on inputs where the right answer is "nothing here".

Property	Hallucination	Find-something bias
What's wrong	Fact is false	Finding is unwarranted
Trigger	Out-of-distribution query	Exploratory-task framing
The model "thinks" it's	Right	Doing its job
Mitigation	Grounding, retrieval	Calibration, hedge filtering

The two can compound — a hallucinated fact embedded inside a hedged finding is the worst case — but they are operationally distinct failure modes.

Where the bias comes from (hypotheses)¶

Cloudflare's framing is not mechanistic; the "reasonable bias for an exploratory tool" description suggests the behaviour is selected for during training because it produces helpful-looking output in the median case. Plausible training-time pressures:

RLHF reward shape. A response of "I don't see anything" on a "find bugs" prompt looks lazy/unhelpful to a human rater, biasing reward toward the model that produces a hedged finding instead.
Over-generalisation from positive examples. Training data heavily features "find bugs in code → here's a bug" trajectories; "find bugs in code → there are none" is rare.
Policy gradient toward verbose outputs. Longer outputs are sometimes higher-rated; an empty finding list is pathologically short.

Why the hedges don't actually reduce cost¶

Cloudflare's verbatim cost framing:

"That's a reasonable bias for an exploratory tool. It's a ruinous one for a triage queue, where every speculative finding spends human attention and tokens to dismiss, and that cost compounds across thousands of findings."

A hedge like "this could potentially be a buffer overrun" does not let the human skip reading the code. The human still has to verify. The hedge only reduces the probability the finding is real, not the cost of dismissing it. At volume, the integrated cost of dismissing thousands of hedged findings dominates the cost of triaging the few real ones.

This is why the signal-to-noise problem cannot be fixed by "prompt the model to be more careful" — the model produces hedged findings because it's being careful. The fix has to be external to the finder: PoC-attached findings, adversarial validators, dedup.

Architectural levers against the bias¶

Demand a proof of exploitability: shift the verification cost from human reading to model-driven runtime reproduction. A hedged finding without a PoC is dropped.
Adversarial validator with no finding-emission ability: a second agent's job is to refute, not augment. Without the no-emit constraint the validator multiplies the bias rather than checking it.
Calibration prompts (less effective): "if you're not sure, say so explicitly" — observed to help marginally but not collapse the bias.
Two-stage filtering: "are these findings real?" as a separate prompt to a fresh-context agent — overlaps with the adversarial-validator pattern.

Sibling failure modes on the wiki¶

concepts/agent-hyperfixation-failure-mode — the commitment failure (model commits to first hypothesis and refuses to back off). Find-something bias is the emission failure (model emits findings even when there are none). Both share the underlying issue that the model's internal uncertainty is not reflected in its output behaviour.
concepts/model-organic-refusal-inconsistency — a sibling on the opposite side: where find-something bias over-emits, refusal inconsistency over-blocks (or inconsistently blocks). Both demonstrate the model's decisions over its own behaviour are unreliable in reproducible ways.
concepts/adversarial-review-persona — names the cooperative dual: "please review this PR" prompts bias toward agreement; the adversarial-persona prompt inverts it. Different surface (review vs find) but the same underlying calibration-deficit mechanism.

Where this concept generalises beyond vuln research¶

The same bias shows up wherever a model is asked an exploratory question:

Code review: the "please review this PR" default produces validation; "red-team this PR" produces challenge. (See concepts/adversarial-review-persona.)
Architecture design review: "any concerns with this design?" produces hedged concerns; "if this design fails, how would it fail?" produces concrete failure modes.
Testing: "are there test cases I missed?" tends to generate test cases regardless of coverage status.
Postmortems: open-ended "what went wrong?" prompts generate plausible causal stories even on incidents with thin evidence (see concepts/surface-attribution-error).

Seen in¶

sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — canonical wiki articulation, in the AI-vulnerability- research context.

concepts/llm-hallucination — adjacent failure mode.
concepts/signal-to-noise-in-ai-vulnerability-triage — the operational consequence at fleet scale.
concepts/proof-of-exploitability — the in-finding lever.
concepts/false-positive-management — the detection-system sibling.
concepts/agent-hyperfixation-failure-mode — sibling calibration failure.
concepts/adversarial-review-persona — cooperation-bias dual.
patterns/adversarial-review-subagent — the architectural pattern that filters the bias's output.