CONCEPT Cited by 1 source

Cheapest-capable model routing¶

Definition¶

Cheapest-capable model routing is the per-task discipline of classifying every AI-agent task and routing it to the cheapest model that can actually do the task, rather than always using the most capable model available. The premise is that most agent work doesn't need a frontier model — classification, code generation, refactoring, summarisation are well-handled by smaller local models — and the expensive cloud models exist to be a fallback for the hard cases, not the default.

"What's the cheapest model that can do this task? Most agent work doesn't need a frontier model. Classification, code generation, refactoring, and summarization — a local model running on hardware I already own handles them fine. The expensive cloud models are a fallback for the hard cases, not the default. So every task gets classified and routed to the cheapest model that can actually do it, and the result gets checked before it counts as done." — Source: sources/2026-06-02-redpanda-how-omninode-uses-redpanda-to-scale-ai-agent-workflows

The two-axis cost framing¶

Cheapest-capable routing makes economic sense because the cost gap between local and cloud models has two axes that compound:

Marginal cost per token — local models running on hardware the team already owns are zero marginal cost; cloud models charge per token in/out. The cost gap is large.
Capability gap is task-conditional — for a meaningful fraction of tasks, the local-model output is good enough (passes compliance checks, contains required citations, doesn't hallucinate identifiers). For those tasks, the capability gap is zero at the task's quality bar.

When the capability gap is zero at the bar, paying the cost gap is pure waste. The OmniNode disclosed week-of-numbers make the economics explicit:

75% of tokens never left the building (routed to four on- prem hosts at zero marginal cost).
$3.37 in cloud spend was avoided, compared to $2.43 actually spent.
At a larger scale, that ratio is the whole business case.

Required co-mechanisms¶

Cheapest-capable routing only works if you have:

A classifier per task class that decides what model class this task needs. The OmniNode post characterises the input as "every task gets classified" but does not disclose the classifier shape.
An on-prem fleet capable of handling the local-class load with acceptable latency. OmniNode's "four on-prem hosts at zero marginal cost" is the disclosed footprint.
A quality bar that's checkable — "the result gets checked before it counts as done." Without a bar, you can't tell whether the cheap model actually delivered.
Auto-escalation — see patterns/auto-escalation-on-quality-failure — when the cheap model fails the bar, the task automatically escalates to a stronger model. OmniNode's disclosed escalation rate: 1.3% of delegations.
A routing receipt — every routing decision produces an audit-grade record of the model chosen, tokens used, cost, and whether the output passed the compliance check. See concepts/routing-receipt.

Why "cheap routing only works if you can trust it"¶

The OmniNode framing is precise: cheapest-capable routing is useless without a verification step. "Cheap routing only works if you can trust it, and that is where the contracts come back in." The contract is the declared quality bar; the receipt is the evidence that it was met. Without both, cheapest-capable routing degenerates into hopeful cost reduction: "hand work to the cheapest model and hope it went well."

The architectural pairing that makes it production-shaped:

Decision = the routing contract (classification → model class → quality bar).
Evidence = the receipt (which model ran, what it cost, whether the output passed).
Recovery = auto-escalation when the evidence shows the bar wasn't met.

OmniNode's slogan: "The decision is a contract. The receipt is the evidence. Neither lives in someone's head."

Sibling framings on the wiki¶

concepts/multi-llm-sub-agent-routing — Databricks Genie's per-subagent model selection. Routes tasks to different models based on the task type, not the cost. OmniNode's framing folds cost into the routing decision as a primary axis.
concepts/non-uniform-llm-request-cost — Databricks Axon's framing of non-uniform per-request cost as a load- balancing problem. Sibling to OmniNode's per-task cost-axis routing but at a different altitude (capacity / scheduling rather than model selection).
patterns/complexity-tiered-model-selection — Vercel's knowledge-agent framing of tiered model selection by query complexity. Same shape as OmniNode's cheapest-capable framing, applied to a single product workflow rather than a generic agent platform.
patterns/multi-cloud-llm-serving — Slack's multi-cloud framing for LLM serving. Sibling at the provider axis; OmniNode is at the per-task model selection axis. The two compose: cheapest-capable routing inside a multi-cloud routing envelope.

Caveats¶

The classifier itself costs something to run; whether the classifier is a small model or a heuristic isn't disclosed. The classifier-cost tax shows up at very small task sizes.
The bar's calibration matters more than the routing logic. Set the bar too low and cheap-model output passes when it shouldn't; set it too high and everything escalates and the cost savings disappear. The OmniNode post sketches the bar ("output is too short, missing citations, or hallucinated identifiers") but doesn't disclose the calibration mechanism.
Latency profile is asymmetric: on-prem inference may be slower than the cloud fallback, especially on cold-start. The routing decision is cost-and-quality, not cost-and-latency.
Cost attribution requires the receipt. Without per-decision receipts, cheapest-capable routing produces a mixed-fleet bill that's hard to explain. See concepts/routing-receipt.

Seen in¶

sources/2026-06-02-redpanda-how-omninode-uses-redpanda-to-scale-ai-agent-workflows (2026-06-02, OmniNode founder Jonah Gray on Redpanda Blog) — canonical disclosure source. Provides the per-task classify- then-route framing, the frontier-model-as-fallback stance, the decision/receipt/recovery triad, and concrete week-of metrics: 75% on-prem token routing, $3.37 saved vs $2.43 spent, 1.3% escalation rate. Pairs explicitly with concepts/routing-receipt ("the receipt is the evidence") and patterns/auto-escalation-on-quality-failure.

concepts/routing-receipt — the evidence layer the routing decision produces.
patterns/auto-escalation-on-quality-failure — the recovery mechanism.
concepts/multi-llm-sub-agent-routing — sibling task-conditioned routing framing.
concepts/non-uniform-llm-request-cost — sibling cost-axis framing at the load-balancer altitude.
patterns/complexity-tiered-model-selection — sibling tiered- selection pattern.
systems/omninode — the canonical wiki adopter.