Skip to content

PATTERN Cited by 1 source

Coordinator / sub-reviewer orchestration

Intent

Decompose an AI-driven review / critique / analysis task into:

  • A coordinator agent running on the highest-capability model tier.
  • N parallel sub-reviewer agents, each scoped to a single domain / axis / rubric, running on right-sized cheaper models.
  • A final judge pass inside the coordinator — dedup, re-categorise, drop false positives, read source to verify, emit structured output.

The coordinator is the only agent that sees the aggregate; sub-reviewers never see each other's output.

When to reach for it

  • High-stakes analysis over rich input where a single-prompt review is too shallow and a dozen overlapping prompts too noisy.
  • Separable domains — security vs. performance vs. code quality; legal vs. copy vs. SEO; correctness vs. completeness vs. policy.
  • Cost-per-reasoning-unit differences across model tiers — the coordinator's judgement call is worth the top tier; fan-out rarely is.
  • Structured-output consumers downstream (a CI gate, a ticket system, a dashboard) — the coordinator writes the schema-conforming final artifact.

Shape

          input (MR, doc, artifact, ...)
          ┌───────────────────┐
          │  Coordinator      │  ← top-tier model (Opus 4.7 / GPT-5.4)
          │  (reads full ctx) │
          └─────────┬─────────┘
                    │ spawn_reviewers(domains, shared_context_path)
      ┌─────┬───────┼───────┬─────┬──────┐
      ▼     ▼       ▼       ▼     ▼      ▼
  [sec] [perf] [quality] [docs] [rel] [codex]
   │     │     │          │     │      │
   │     │     │          │     │      │   ← each runs on a cheaper tier
   │     │     │          │     │      │      and reads shared-mr-context.txt
   │     │     │          │     │      │      + per-file patches
   ▼     ▼     ▼          ▼     ▼      ▼
  structured XML findings (critical/warning/suggestion)
          ┌───────────────────┐
          │  Judge pass       │  ← dedup / re-categorise / drop FPs /
          │  (coordinator)    │    verify by reading source
          └─────────┬─────────┘
        final structured output + action
        (approve / approved_with_comments / block)

Mechanism

  1. Coordinator spawned as child process. Prompt via stdin to avoid ARG_MAX; JSONL on stdout so the orchestrator can process events in real time. See patterns/jsonl-streaming-child-process.
  2. spawn_reviewers as a tool. The coordinator calls a runtime-plugin tool to launch sub-agents; it does not see their tools, prompts, or individual outputs — it only sees their final structured findings.
  3. Specialised prompts per sub-reviewer. Each sub-agent has a tightly scoped prompt with an explicit "What NOT to flag" section — see concepts/what-not-to-flag-prompt.
  4. Shared-context file on disk. Sub-reviewers read shared-mr-context.txt (and per-file patches) via file tool — never embedded in prompt — to exploit provider prompt caching.
  5. Structured severity output. Findings carry a severity enum (critical / warning / suggestion); downstream action is keyed off severity, not free-text.
  6. Judge pass. Coordinator deduplicates (same issue from two reviewers kept once in the best-fit section), re-categorises (perf bug flagged by code-quality moves to perf section), drops speculative items, and may call its tools to read the source to verify an ambiguous finding.
  7. Timeouts at three levels. Per-task (5 min typical; 10 for heavier reviewers), overall (e.g. 25 min), retry-budget minimum (e.g. 2 min).
  8. Resilience. Circuit breaker + failback chains per model tier isolate sub-reviewer provider failures from the coordinator's judgment logic.

Why it works better than a single prompt

  • Small tool inventory per agent → reliable tool selection (patterns/specialized-agent-decomposition).
  • Narrow prompt per agent → better "what NOT to flag" discipline; less drift between domains.
  • Independent parallelism → linear latency in critical-path.
  • Deduplication + re-categorisation happens centrally → downstream consumer sees one clean output, not N overlapping ones.
  • Coordinator judgment cost is amortised against N parallel sub-reviewer calls → top-tier model used only where it matters.

Tradeoffs

  • Orchestration complexity. Plugin architecture, IPC, timeouts, failback — harder than a single prompt.
  • Cost scales with diff size. Seven frontier-model calls on a 500-file refactor is real money. Pair with patterns/ai-review-risk-tiering to gate expensive tiers.
  • Coordinator is the bottleneck. Its context must fit summarised sub-reviewer output + full MR metadata; Cloudflare warns when coordinator prompt exceeds 50% of estimated context window.
  • Sub-reviewer output must be structured. If any sub-agent returns malformed output, the coordinator's judge pass has to handle it — see concepts/structured-output-reliability.
  • vs. patterns/planner-coder-verifier-router-loop — Google DS-STAR's role-based decomposition sits inside a refinement loop (plan → code → verify → route-add-or-fix → repeat). Coordinator / sub-reviewer is single-pass with fan-out; the "judge pass" is not a refinement round, it's a consolidation.
  • vs. patterns/specialized-agent-decomposition — sub-reviewer is one realisation of the domain-based decomposition variant, with the coordinator explicitly staged as the consolidator.
  • vs. concepts/llm-as-judge — sub-reviewer output consolidation uses judge-pattern discipline inside the coordinator; not every LLM-as-judge application is multi-agent.

Seen in

  • sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — canonical production instance. 7 specialised sub-reviewers (security, performance, code quality, documentation, release, AGENTS.md, engineering-codex) with coordinator on Opus 4.7 / GPT-5.4, sub-reviewers on Sonnet 4.6 / GPT-5.3 / Kimi K2.5. 131,246 review runs across 5,169 repos in first 30 days; median review 3m 39s; 1.2 findings per review; 85.7% prompt-cache hit rate driven partly by the shared-context file.
Last updated · 200 distilled / 1,178 read