CONCEPT Cited by 1 source
Multi-LLM sub-agent routing¶
Multi-LLM sub-agent routing is an agent architecture in which different sub-agents inside a single agent system use different LLMs, each chosen for the specific sub-task — planning, search, code generation, judging — based on observed complementary capability profiles. The 2026-05-08 Databricks post on Genie coins this as a named architectural advance, alongside parallel thinking and specialised knowledge search.
The structural property that makes it possible: agent sub-tasks have complementary capability profiles that no single LLM optimises across, and the platform makes it cheap to swap models per sub-agent.
The architectural property¶
┌──────────────────────────────────────────┐
│ User question / query │
└──────────────────┬───────────────────────┘
▼
Planning sub-agent
(LLM A — high-level reasoning)
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
Search sub-agent Code-gen sub-agent Judge sub-agent
(LLM B — fast (LLM C — strong (LLM D — high
retrieval-tuned) SQL synthesis) precision eval)
▼ ▼ ▼
└───────────────────┼───────────────────┘
▼
Aggregator → Answer
Each box is a sub-agent. Each box independently picks the best LLM (commercial frontier, open-source, custom-trained) for its slice. The platform property "seamless to try out any of the frontier models" makes per-sub-agent assignment a tractable engineering decision rather than a research project.
Why no single LLM is best across all sub-tasks¶
The Databricks post observes: "different LLMs are good at complementary capabilities... different LLMs result in very different latency and cost characteristics." Concretely:
| Sub-agent | What it needs | Some LLMs excel at |
|---|---|---|
| Planning | Multi-step reasoning, decomposition, tool-call orchestration | Frontier reasoning models (Opus, GPT, Gemini) |
| Search | Fast pattern matching, schema understanding | Smaller / faster models with retrieval tuning |
| Code generation | Strong SQL synthesis, dialect awareness, schema grounding | Code-specialised models or larger general models |
| Judging | Calibrated quality assessment, accurate disagreement detection | High-precision evaluator models |
A single-LLM agent forces one model to do all four — paying the planner's reasoning cost on simple search calls, or the search's speed-tuning weakness on the planning call.
Three-axis simultaneous improvement¶
The post claims Multi-LLM (combined with parallel thinking and specialised knowledge search) drives:
| Axis | Direction |
|---|---|
| Accuracy | ↑ (32% → >90% vs leading coding agent baseline) |
| Cost | ↓ (significantly reduced) |
| Latency | ↓ (significantly reduced) |
This is unusual. The typical assumption is that more sophisticated agent design trades cost / latency for accuracy (more model calls = more cost). Multi-LLM beats this by:
- Using expensive frontier models only where they pay off (planning, judging) — not across the whole pipeline.
- Using fast, cheap, narrowly-tuned models for the high-volume sub-tasks (search, simple retrieval).
- Combining with GEPA prompt optimization which closes accuracy gaps left by smaller / cheaper models on their assigned sub-tasks.
GEPA's role¶
The post explicitly references GEPA — "the corresponding accuracy and cost can be further optimized using methods like GEPA" — on table-search sub-agents. GEPA is the prompt-optimisation method that closes the gap between "this LLM is best at this sub-task" and "this LLM with the best prompt is best at this sub-task." The combination of (a) per-sub-agent model selection + (b) per-sub-agent prompt optimisation is the shape that delivers the simultaneous improvement on all three axes.
Distinguishing Multi-LLM from related shapes¶
| Shape | Distinguishing property |
|---|---|
| Multi-LLM sub-agent routing (this concept) | Different LLMs for different sub-tasks within one agent system; per-sub-agent prompt optimisation |
| concepts/llm-cascade | Same task, escalation chain — try cheap model first, escalate to expensive only on failure |
| concepts/multi-llm-debate (if it exists in wiki) | Multiple LLMs argue the same task — adversarial / consensus seeking |
| Mixture-of-experts (model-internal) | Within a single model, different experts activate per token; not multi-model |
| concepts/objective-abstraction (model-serving) | Routing layer abstraction that lets clients pick a model — not internal sub-agent decomposition |
The distinguishing axis: Multi-LLM sub-agent routing is internal to one agent's design, across its sub-tasks; the others operate at different altitudes (escalation, debate, model-internal, client-facing).
When this fits / doesn't¶
Fits:
- Agent has clearly separable sub-tasks with different capability profiles.
- Platform makes model swapping cheap (e.g., Databricks' AI Gateway, unified inference plane).
- High-volume queries make the per-call cost optimisation worth the engineering investment.
- Prompt-optimisation tooling available (GEPA or similar).
Doesn't fit:
- Sub-tasks are too tightly coupled to separate cleanly.
- No infrastructure for swapping models — every model swap is a multi-week deployment exercise.
- Low-volume agent — the engineering cost of per-sub-agent tuning exceeds the cost saved.
- Single LLM is dominantly best across all sub-tasks (rare in practice).
Relationship to related concepts¶
- concepts/data-agent-unique-challenges is the problem class driving the search for accuracy gains.
- concepts/parallel-thinking-trajectory-sampling introduces cost; Multi-LLM recovers it — they compose.
- systems/gepa-prompt-optimizer is the per-sub-agent prompt optimisation tool referenced.
- patterns/llm-per-subagent-with-optimized-prompts is the pattern that operationalises this concept.
Seen in¶
- sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-genie — canonical first wiki disclosure of Multi-LLM sub-agent routing as a named architectural advance. Genie uses different LLMs per sub-agent (planning / search / code-gen / judges); platform makes this seamless across Opus / GPT / Gemini / OSS / custom; combined with GEPA prompt optimisation, accuracy + cost + latency improve simultaneously (Figure 1 end-state). Positioned as the architectural response to the no-single-LLM-is-best-across-all-sub-tasks property.