Skip to content

PATTERN Cited by 1 source

LLM per sub-agent with optimized prompts

LLM per sub-agent with optimised prompts is an agent-design pattern in which a single agent system uses different LLMs for different internal sub-agents (planning, search, code generation, judging), each with per-sub-agent prompt optimisation (e.g., via GEPA). The pattern is the structural mechanism through which Multi-LLM sub-agent routing delivers simultaneous improvement on accuracy + cost + latency.

Canonicalised in the 2026-05-08 Databricks post on Genie as one of the three architectural advances enabling Genie's accuracy lead over a "leading coding agent" baseline (32% → over 90% on Databricks' internal benchmark).

The pattern

                       Agent system
  ┌─────────────────────────────────────────────────────────┐
  │                                                          │
  │   ┌───────────────────────┐                              │
  │   │  Planning sub-agent   │◄── LLM A + prompt P_A        │
  │   └──────────┬────────────┘    (frontier reasoning)      │
  │              │                                            │
  │   ┌──────────▼─────────┐                                  │
  │   │  Search sub-agent  │◄── LLM B + prompt P_B            │
  │   └──────────┬─────────┘    (fast retrieval-tuned)        │
  │              │                                            │
  │   ┌──────────▼──────────┐                                 │
  │   │ Code-gen sub-agent  │◄── LLM C + prompt P_C           │
  │   └──────────┬──────────┘    (SQL synthesis)              │
  │              │                                            │
  │   ┌──────────▼──────────┐                                 │
  │   │   Judge sub-agent   │◄── LLM D + prompt P_D           │
  │   └─────────────────────┘    (quality evaluation)         │
  │                                                            │
  └────────────────────────────────────────────────────────────┘

  Each prompt P_i is GEPA-optimised for its (LLM, sub-task) pair.

The two distinct moves:

  1. Per-sub-agent LLM assignment — pick best-of-class for each slice of the agent's work.
  2. Per-sub-agent prompt optimisation — for the chosen LLM and its sub-task, optimise the prompt (e.g., with GEPA) so that smaller / cheaper models can recover accuracy that frontier models would have provided with a generic prompt.

Why both moves are necessary

Move Without the other With both
Multi-LLM only (no prompt opt) Smaller models on cheap sub-tasks underperform; gain is small n/a
Prompt opt only (single LLM) Stuck with one model's capability profile across all sub-tasks n/a
Both combined n/a Each (LLM, sub-task) pair runs at its optimised operating point — accuracy + cost + latency all gain

The Databricks post explicitly references this combination as the mechanism: "different LLMs perform on table search tasks and how the corresponding accuracy and cost can be further optimized using methods like GEPA."

Sub-agent decomposition

The pattern requires identifying clearly separable sub-tasks with distinct capability profiles. Genie's disclosed decomposition:

Sub-agent Capability needed Volume per query Cost sensitivity
Planning Multi-step reasoning, tool-call orchestration Low (1 / query) Low (one expensive call OK)
Search Asset retrieval / matching High (many calls) High (per-call cost matters)
Code generation SQL synthesis, schema grounding Medium Medium
Judges Calibrated quality evaluation Medium-low (1 per N trajectories) Low

The pattern's value comes from mismatched profiles — if all sub-agents had identical profiles, single-LLM would tie. The complementary capabilities observation is what makes the pattern worth the engineering investment.

Figure 6 of the source post specifically shows table-search sub-agents running on different LLMs, with GEPA optimising the corresponding prompts. Disclosed property: "how different LLMs perform on table search tasks and how the corresponding accuracy and cost can be further optimized using methods like GEPA." Specific numbers not disclosed.

Composition with parallel thinking

Without parallel thinking With parallel thinking
One trajectory, multi-LLM per sub-agent N trajectories, each with multi-LLM per sub-agent
Single sample per sub-task Multiple samples; aggregator picks

These compose naturally — Genie does both. Multi-LLM per sub-agent + parallel trajectory sampling is double diversity: across trajectory boundaries (sampling) and across sub-agent boundaries (model variety).

Operationalising: what infrastructure is needed

Component Purpose
Unified inference plane Allow any LLM to be invoked from any sub-agent without per-model integration cost
Prompt versioning + management Each (LLM, sub-task) pair has its own optimised prompt; manage as code
Prompt optimisation tooling GEPA or equivalent — feedback loop on prompt quality
Per-sub-agent telemetry Measure accuracy + cost + latency at the sub-agent altitude (not just end-to-end) — the engineering decision needs per-slice data
Cost guardrails Frontier models on planning sub-agent can be expensive; need per-call budget controls

Databricks' platform property "seamless to try out any of the frontier models (including Opus, GPT, and Gemini), open-source models, as well as custom trained models" is what makes the pattern tractable; without that the per-sub-agent assignment is a multi-week-per-swap exercise.

When this fits / doesn't

Fits:

  • Agent has clearly separable sub-tasks with different capability profiles.
  • Inference platform makes per-model swapping cheap.
  • Prompt-optimisation tooling available.
  • High-volume agent — engineering investment amortised.
  • Clear telemetry per sub-agent for tuning.

Doesn't fit:

  • Sub-tasks are too tightly coupled to separate cleanly.
  • No infrastructure for swapping models — every swap is a major deployment.
  • Low-volume agent — engineering cost > savings.
  • Single LLM is dominantly best across all sub-tasks (rare in practice).

Anti-patterns

  • Pick best per sub-agent, no prompt optimisation — leaves significant accuracy on the table; smaller models underperform on generic prompts.
  • Optimise prompts only on flagship LLM — fails to adapt prompts to the smaller / faster model assigned to high-volume sub-tasks.
  • Same prompt across LLMs — different models respond differently to the same prompt; what's optimal for Opus isn't for Gemini.
  • No per-sub-agent telemetry — can't tell where the bottleneck is; blind tuning.
  • Frontier model on every sub-agent — defeats the cost benefit; reserve frontier for sub-tasks that pay off (planning, judging).

Seen in

  • sources/2026-05-08-databricks-pushing-the-frontier-for-data-agents-with-geniecanonical first wiki disclosure of LLM-per-sub-agent + GEPA- optimised-prompts as a named pattern. Genie uses different LLMs for planning / search / code-gen / judge sub-agents; GEPA optimises the corresponding prompts; combined effect is simultaneous improvement on accuracy + cost + latency (Figure 1 end-state). Specific (LLM, sub-task) assignments not disclosed publicly.
Last updated · 542 distilled / 1,571 read