PATTERN Cited by 1 source
Specialized reviewer agents¶
Intent¶
Instead of one LLM reviewing every axis of a code change, run N domain-specific sub-reviewers, each with a narrow prompt, narrow tool surface, and structured severity-tagged output. The coordinator aggregates.
This is the per-domain specialisation variant of patterns/specialized-agent-decomposition, applied to the code review surface. It is the sub-agent structure the patterns/coordinator-sub-reviewer-orchestration pattern composes.
Why specialisation matters here¶
A single general-purpose reviewer prompt that tries to cover security, performance, code quality, documentation, release management, and internal compliance at once suffers three compounding failures:
- "What NOT to flag" becomes unmanageable. Each domain has its own exclusion list; combining them all into one prompt either drops exclusions or exceeds context budget.
- Severity calibration drifts. A security-critical finding gets diluted next to a doc-suggestion-level finding; one scale fits all badly.
- Tool inventory interference. Security wants
grepover secrets patterns; documentation wants markdown rendering; performance wants profile inspection. A combined toolset enables the wrong one on the wrong domain.
Splitting along domain lines gives each agent a tight prompt, a tight tool surface, and a calibrated severity ladder.
Reviewer roster (Cloudflare's instance)¶
| Reviewer | Role | Model tier | Canonical exclusions |
|---|---|---|---|
| Security | Injection / auth / secrets / crypto / input validation | Standard (Sonnet 4.6 / GPT-5.3) | Theoretical risks needing unlikely preconditions, defense-in-depth when primary defenses adequate, issues in unchanged code, "consider library X" suggestions |
| Performance | Regressions, hot-path costs, algorithmic complexity | Standard | — |
| Code quality | Correctness, maintainability, style-with-substance | Standard | — |
| Documentation | README / inline / doc-string currency | Kimi K2.5 | — |
| Release management | Release-related file changes | Kimi K2.5 | — |
| AGENTS.md | Materiality vs. AI-instruction staleness + anti-pattern penalties | Kimi K2.5 | — |
| Engineering codex | Internal RFC compliance | Standard | — |
Shape of each reviewer's prompt¶
Every reviewer prompt is built at runtime by concatenating:
- Agent-specific markdown file — the
{reviewer-name}.mdwith positive + negative lists. REVIEWER_SHARED.md— mandatory rules applicable to every reviewer.- Pointer to
shared-mr-context.txt+ per-file patches indiff_directory/.
The agent-specific file always contains a ## What to Flag and ## What NOT to Flag section. The latter is where prompt-engineering value accrues — see concepts/what-not-to-flag-prompt.
Structured output contract¶
Every reviewer produces findings in structured XML with severity enum:
critical— will cause an outage or is exploitable.warning— measurable regression or concrete risk.suggestion— an improvement worth considering.
"This ensures we are dealing with structured data that drives downstream behavior, rather than parsing advisory text."
Downstream rubric (at the coordinator's judge pass):
| Condition | Decision | GitLab action |
|---|---|---|
| All LGTM or trivial suggestions only | approved |
POST /approve |
| Suggestion-only items | approved_with_comments |
POST /approve |
| Some warnings, no production risk | approved_with_comments |
POST /approve |
| Multiple warnings suggesting a risk pattern | minor_issues |
POST /unapprove |
| Any critical item | significant_concerns |
/submit_review requested_changes (blocks merge) |
Model-tier heterogeneity inside the pattern¶
Not every specialist needs the same model tier. Cloudflare's assignment:
- Top tier (Opus 4.7, GPT-5.4) — Coordinator only.
- Standard tier (Sonnet 4.6, GPT-5.3 Codex) — Code Quality, Security, Performance (the heavy-lifting reviewers).
- Kimi K2.5 on Workers AI — Documentation, Release, AGENTS.md (text-heavy, lower reasoning demand).
Rationale: "Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers." — same structural-vs-reasoning split logic as sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory's Llama 4 Scout + Nemotron 3 assignment.
Production observations (first 30 days)¶
| Reviewer | Total findings | Critical % |
|---|---|---|
| Code Quality | 74,898 | 8.6% |
| Documentation | 26,432 | 0.6% |
| Performance | 14,615 | 0.4% |
| Security | 11,985 | 4.0% |
| Codex (compliance) | 9,654 | 2.3% |
| AGENTS.md | 6,878 | 0.3% |
| Release | 745 | 2.6% |
Security has the highest critical rate but not the highest volume — exactly the calibration that specialisation enables. Code Quality dominates volume but criticals are meaningful too (6,460) because the domain surface is wide.
Tradeoffs¶
- Prompt maintenance cost. N prompts to keep current as the codebase evolves.
- Consistency risk. Overlapping findings between reviewers need the coordinator's judge pass to dedup.
- Model-tier sprawl. Right-sizing per reviewer is a calibration exercise; defaults must exist.
- Domain boundaries drift. A perf bug might surface in code-quality's output; coordinator re-categorises.
Seen in¶
- sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — 7 specialised reviewers in production across 5,169 repos; finding distribution by reviewer disclosed.
Related¶
- patterns/coordinator-sub-reviewer-orchestration — the composition pattern this is the sub-agent half of.
- patterns/specialized-agent-decomposition — the broader decomposition family.
- concepts/what-not-to-flag-prompt — the prompt-engineering discipline per reviewer.
- concepts/structured-output-reliability — why structured XML matters for downstream action.
- concepts/llm-as-judge — the coordinator's consolidation discipline.
- systems/cloudflare-ai-code-review — canonical production instance.