CONCEPT Cited by 1 source

Model organic refusal inconsistency¶

Definition¶

Model organic refusal inconsistency is the failure mode where a frontier LLM's emergent (unprogrammed) refusals to sensitive requests are real but not reproducible — the same logical request, framed differently or executed in a different context, can produce opposite outcomes.

The canonical articulation comes from Cloudflare's 2026-05-18 Project Glasswing writeup on Mythos Preview running without GA safeguards:

"the model organically pushes back on certain requests — much like the cyber capabilities that made it useful for vulnerability hunting, the model has its own emergent guardrails that sometimes cause it to push back on legitimate security research requests. But as we found, these organic refusals aren't consistent — the same task, framed differently or presented in a different context, could produce completely different outcomes."

Concrete examples observed¶

Cloudflare describes three repeated failure-mode shapes:

Context-flip refusal. "The model initially refused to do vulnerability research on a project, then agreed to perform the same research on the same code after an unrelated change to the project's environment. Nothing about the code being analyzed had changed."
Phase-flip refusal. "The model found and confirmed several serious memory bugs in a codebase, and then refused to write a demonstration exploit." The model accepted the find phase but rejected the exploit phase of the same pipeline, on the same artefacts.
Probabilistic non-determinism. "the same request can produce different outcomes across runs due to the probabilistic nature of the model. Semantically equivalent tasks can produce opposite outcomes depending on how and when they're presented to the model."

Why this is organic and not policy-driven¶

Cloudflare's framing distinguishes policy-driven refusals (safeguards explicitly trained or system-prompted) from organic refusals: the latter are "emergent guardrails" the model develops without explicit programming, akin in shape to the "cyber capabilities that made it useful for vulnerability hunting" — both emerge from training rather than from explicit rule sets.

The two are related but operationally distinct:

Property	Policy refusal	Organic refusal
Source	Explicit training / system prompt	Emergent from broader training
Reproducibility	High (same prompt → same refusal)	Low (framing-sensitive, run-to-run variable)
Coverage	Bounded by policy	Unbounded but inconsistent
Bypass	Documented (jailbreak research)	Often accidental (rephrasing flips it)

Why this matters operationally¶

Cloudflare's stated position — load-bearing for the "cyber frontier model" class:

"the model's organic refusals/guardrails are real, they aren't consistent enough to serve as a complete safety boundary on their own. That's precisely why any capable cyber frontier model made generally available in the future must include additional safeguards on top of this baseline behavior."

The conclusion drives a defense-in-depth posture toward LLM safeguards: organic refusal + trained-policy refusal + operational guardrails outside the model are stacked because none alone is reliable.

Why inconsistency is structural, not a bug¶

The Cloudflare post names the structural cause directly: "the probabilistic nature of the model". Three properties compound:

Sampling stochasticity — temperature > 0 makes outputs vary run-to-run on the same input.
Framing sensitivity — semantically equivalent prompts occupy different points in the model's input distribution; the refusal-classifier behaviour differs across them.
Context dependence — preceding tokens shift the model's state; the same final query inside different conversations resolves differently.

Sufficient training on refusal examples can shrink the inconsistency band but cannot collapse it to zero while the model remains probabilistic.

Implications for AI-system design¶

Don't rely on organic refusal as the only safety layer. Even if the refusal is real in the median case, the inconsistent-outlier shape means a determined operator (or, more practically, a careless one) finds the not-refused phrasing eventually.
Stack with explicit safeguards. Cloudflare's own post is exhibit-A: the program (Project Glasswing) operates with the model in "a controlled research context" — the operational boundary substitutes for the unreliable in-model boundary. Future GA cyber frontier models "must include additional safeguards" — the architectural precondition for moving these models into broader use.
Distinguish refusal from non-determinism in evaluation. When evaluating whether a model is "safe enough", the refusal rate per N runs of the same prompt is a metric; any rate < 100 % is an inconsistency datum.

Sibling concepts¶

concepts/prompt-injection — adversarial control of the model's input. Refusal inconsistency is the defender- side dual: even without an adversarial input, framing variation alone can flip the refusal.
concepts/ai-agent-guardrails — the explicit guardrail layer that organic refusals are an emergent precursor to.
concepts/defense-in-depth — the architectural posture refusal inconsistency motivates for LLM safeguards.

Open / not disclosed¶

Refusal-rate quantification — "sometimes", "different outcomes across runs"; no n-of-N numbers given.
Whether GA-tier Anthropic models (Opus 4.7) have lower inconsistency — the post compares the presence of additional safeguards but not the consistency of the underlying organic refusals.
Anthropic's own characterisation of this phenomenon — the wiki disclosure is via Cloudflare's external use-case retrospective, not Anthropic's own research output.

Seen in¶

sources/2026-05-18-cloudflare-project-glasswing-what-mythos-showed-us — first canonical wiki articulation; the load-bearing caveat justifying "controlled research context" distribution for cyber frontier models.

systems/mythos-preview — the model where the behaviour was observed.
systems/anthropic-project-glasswing — the controlled- context distribution model that compensates for the inconsistency.
concepts/cyber-frontier-model — the model class for which this caveat is a class invariant.
concepts/defense-in-depth — the posture stacking organic + policy + operational refusals.
concepts/ai-agent-guardrails — the explicit-guardrail layer.
concepts/prompt-injection — the adversarial-input sibling failure mode.