Skip to content

PATTERN Cited by 1 source

Specialized reviewer agents

Intent

Instead of one LLM reviewing every axis of a code change, run N domain-specific sub-reviewers, each with a narrow prompt, narrow tool surface, and structured severity-tagged output. The coordinator aggregates.

This is the per-domain specialisation variant of patterns/specialized-agent-decomposition, applied to the code review surface. It is the sub-agent structure the patterns/coordinator-sub-reviewer-orchestration pattern composes.

Why specialisation matters here

A single general-purpose reviewer prompt that tries to cover security, performance, code quality, documentation, release management, and internal compliance at once suffers three compounding failures:

  1. "What NOT to flag" becomes unmanageable. Each domain has its own exclusion list; combining them all into one prompt either drops exclusions or exceeds context budget.
  2. Severity calibration drifts. A security-critical finding gets diluted next to a doc-suggestion-level finding; one scale fits all badly.
  3. Tool inventory interference. Security wants grep over secrets patterns; documentation wants markdown rendering; performance wants profile inspection. A combined toolset enables the wrong one on the wrong domain.

Splitting along domain lines gives each agent a tight prompt, a tight tool surface, and a calibrated severity ladder.

Reviewer roster (Cloudflare's instance)

Reviewer Role Model tier Canonical exclusions
Security Injection / auth / secrets / crypto / input validation Standard (Sonnet 4.6 / GPT-5.3) Theoretical risks needing unlikely preconditions, defense-in-depth when primary defenses adequate, issues in unchanged code, "consider library X" suggestions
Performance Regressions, hot-path costs, algorithmic complexity Standard
Code quality Correctness, maintainability, style-with-substance Standard
Documentation README / inline / doc-string currency Kimi K2.5
Release management Release-related file changes Kimi K2.5
AGENTS.md Materiality vs. AI-instruction staleness + anti-pattern penalties Kimi K2.5
Engineering codex Internal RFC compliance Standard

Shape of each reviewer's prompt

Every reviewer prompt is built at runtime by concatenating:

  1. Agent-specific markdown file — the {reviewer-name}.md with positive + negative lists.
  2. REVIEWER_SHARED.md — mandatory rules applicable to every reviewer.
  3. Pointer to shared-mr-context.txt + per-file patches in diff_directory/.

The agent-specific file always contains a ## What to Flag and ## What NOT to Flag section. The latter is where prompt-engineering value accrues — see concepts/what-not-to-flag-prompt.

Structured output contract

Every reviewer produces findings in structured XML with severity enum:

  • critical — will cause an outage or is exploitable.
  • warning — measurable regression or concrete risk.
  • suggestion — an improvement worth considering.

"This ensures we are dealing with structured data that drives downstream behavior, rather than parsing advisory text."

Downstream rubric (at the coordinator's judge pass):

Condition Decision GitLab action
All LGTM or trivial suggestions only approved POST /approve
Suggestion-only items approved_with_comments POST /approve
Some warnings, no production risk approved_with_comments POST /approve
Multiple warnings suggesting a risk pattern minor_issues POST /unapprove
Any critical item significant_concerns /submit_review requested_changes (blocks merge)

Model-tier heterogeneity inside the pattern

Not every specialist needs the same model tier. Cloudflare's assignment:

  • Top tier (Opus 4.7, GPT-5.4) — Coordinator only.
  • Standard tier (Sonnet 4.6, GPT-5.3 Codex) — Code Quality, Security, Performance (the heavy-lifting reviewers).
  • Kimi K2.5 on Workers AI — Documentation, Release, AGENTS.md (text-heavy, lower reasoning demand).

Rationale: "Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers." — same structural-vs-reasoning split logic as sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory's Llama 4 Scout + Nemotron 3 assignment.

Production observations (first 30 days)

Reviewer Total findings Critical %
Code Quality 74,898 8.6%
Documentation 26,432 0.6%
Performance 14,615 0.4%
Security 11,985 4.0%
Codex (compliance) 9,654 2.3%
AGENTS.md 6,878 0.3%
Release 745 2.6%

Security has the highest critical rate but not the highest volume — exactly the calibration that specialisation enables. Code Quality dominates volume but criticals are meaningful too (6,460) because the domain surface is wide.

Tradeoffs

  • Prompt maintenance cost. N prompts to keep current as the codebase evolves.
  • Consistency risk. Overlapping findings between reviewers need the coordinator's judge pass to dedup.
  • Model-tier sprawl. Right-sizing per reviewer is a calibration exercise; defaults must exist.
  • Domain boundaries drift. A perf bug might surface in code-quality's output; coordinator re-categorises.

Seen in

Last updated · 200 distilled / 1,178 read