Skip to content

PATTERN Cited by 1 source

Auto-escalation on quality failure

Pattern

Run the cheapest viable model first. Check the output against a quality bar. If the output fails the bar, escalate the same task to a stronger (and more expensive) model — automatically, without human intervention.

The pattern makes cheapest-capable model routing production-shaped rather than hopeful. Without the auto-escalation step, cheapest-capable routing degenerates to "hand work to the cheap model and hope it went well." With it, the cheap model is the default first attempt; the strong model is the fallback when the bar isn't met; the routing decision is self-correcting at task granularity.

"When a local model can't meet the bar (output is too short, missing citations, or hallucinated identifiers), the task automatically escalates to a stronger model. Last week, that happened on 1.3% of delegations, and the receipt shows exactly why each one escalated." — Source: sources/2026-06-02-redpanda-how-omninode-uses-redpanda-to-scale-ai-agent-workflows

Disclosed quality-bar criteria (OmniNode)

Three concrete failure modes that trigger escalation:

  1. Output is too short — the model didn't produce enough text to plausibly contain the answer. Mechanically detectable (length threshold per task class).
  2. Missing citations — the model didn't cite required sources for grounding-required tasks. Mechanically detectable (regex or grammar match against the expected citation format).
  3. Hallucinated identifiers — the model produced an identifier (function name, file path, ticket ID, person name) that doesn't exist in the system's known set. Mechanically detectable (lookup against the system's actual identifiers).

The shared property: all three are mechanically checkable without invoking another model. The auto-escalation gate is deterministic; it doesn't introduce a new LLM-judge dependency.

Why mechanical checks are the load-bearing choice

Many quality gates use an LLM-as-judge to evaluate the cheap model's output (see concepts/llm-as-judge). That works but introduces a circular cost problem: the judge invocation costs money, and the savings from cheapest-capable routing get partially eaten by the judge's cost.

Mechanical checks (length, regex, identifier-existence) are: - Free at runtime (no model invocation). - Deterministic (no judge-variance to calibrate). - Auditable (the rule is in code, not in a prompt). - Fast (no extra round-trip latency).

OmniNode's framing of "output is too short, missing citations, or hallucinated identifiers" as concrete check classes reads as a deliberate choice for mechanical-check gates over LLM-judge gates at the auto-escalation altitude.

The decision-evidence-recovery pairing

Auto-escalation completes the OmniNode three-step:

Step Mechanism What's recorded
Decision Cheapest-capable router Which model was chosen
Evidence Routing receipt Tokens / cost / pass-or-fail
Recovery Auto-escalation Why escalation triggered

When auto-escalation fires, the receipt records the escalation reason — "the receipt shows exactly why each one escalated" — and that reason can feed back into the classifier. If a task class escalates 30% of the time, the classifier should learn to send it to the strong model upfront.

Disclosed metrics

OmniNode's week-of escalation rate: 1.3% of delegations.

The number is the headline economic argument for the pattern: at this escalation rate, the 97.7% of tasks that completed on the cheap model delivered work without paying frontier-model prices. 75% of all tokens stayed on-prem. "At a larger scale, that ratio is the whole business case."

Sibling patterns on the wiki

  • patterns/model-fallback-hierarchy-with-circuit-breaker — Slack's multi-cloud model fallback for availability failures (rate limit, provider outage). OmniNode's auto-escalation is the quality-failure sibling: same shape, different trigger.
  • patterns/agent-skill-with-fallback-chain — Atlassian's fallback chain for agent skills. Sibling at the task-class altitude rather than the model altitude.
  • patterns/complexity-tiered-model-selection — Vercel's upfront tiered model selection by query complexity. The alternative to auto-escalation: predict capability requirements ahead of time rather than discover them after a quality failure. The two patterns can coexist (predict tier upfront, escalate on failure).

Trade-offs

Latency tax on escalated tasks: a task that escalates pays both the cheap-model latency and the strong-model latency. Acceptable when escalation rate is low (1.3% in OmniNode's case); expensive when the rate is high. The pattern is not a substitute for a well-calibrated classifier — it's a safety net.

Quality-bar calibration is the hardest part: too lax and bad output passes; too strict and everything escalates. The OmniNode post discloses three example checks but doesn't disclose: - Per-task-class threshold tuning. - False-pass / false-fail rates. - How the bar is updated when new failure modes are discovered.

No second-level escalation disclosed: the OmniNode pattern is binary (cheap → strong). Whether the strong model's output is checked against a higher bar before counting as done isn't disclosed.

Seen in

Last updated · 542 distilled / 1,571 read