PATTERN Cited by 1 source

Specialized reviewer agents¶

Intent¶

Instead of one LLM reviewing every axis of a code change, run N domain-specific sub-reviewers, each with a narrow prompt, narrow tool surface, and structured severity-tagged output. The coordinator aggregates.

This is the per-domain specialisation variant of patterns/specialized-agent-decomposition, applied to the code review surface. It is the sub-agent structure the patterns/coordinator-sub-reviewer-orchestration pattern composes.

Why specialisation matters here¶

A single general-purpose reviewer prompt that tries to cover security, performance, code quality, documentation, release management, and internal compliance at once suffers three compounding failures:

"What NOT to flag" becomes unmanageable. Each domain has its own exclusion list; combining them all into one prompt either drops exclusions or exceeds context budget.
Severity calibration drifts. A security-critical finding gets diluted next to a doc-suggestion-level finding; one scale fits all badly.
Tool inventory interference. Security wants grep over secrets patterns; documentation wants markdown rendering; performance wants profile inspection. A combined toolset enables the wrong one on the wrong domain.

Splitting along domain lines gives each agent a tight prompt, a tight tool surface, and a calibrated severity ladder.

Reviewer roster (Cloudflare's instance)¶

Reviewer	Role	Model tier	Canonical exclusions
Security	Injection / auth / secrets / crypto / input validation	Standard (Sonnet 4.6 / GPT-5.3)	Theoretical risks needing unlikely preconditions, defense-in-depth when primary defenses adequate, issues in unchanged code, "consider library X" suggestions
Performance	Regressions, hot-path costs, algorithmic complexity	Standard	—
Code quality	Correctness, maintainability, style-with-substance	Standard	—
Documentation	README / inline / doc-string currency	Kimi K2.5	—
Release management	Release-related file changes	Kimi K2.5	—
AGENTS.md	Materiality vs. AI-instruction staleness + anti-pattern penalties	Kimi K2.5	—
Engineering codex	Internal RFC compliance	Standard	—

Shape of each reviewer's prompt¶

Every reviewer prompt is built at runtime by concatenating:

Agent-specific markdown file — the {reviewer-name}.md with positive + negative lists.
REVIEWER_SHARED.md — mandatory rules applicable to every reviewer.
Pointer to shared-mr-context.txt + per-file patches in diff_directory/.

The agent-specific file always contains a ## What to Flag and ## What NOT to Flag section. The latter is where prompt-engineering value accrues — see concepts/what-not-to-flag-prompt.

Structured output contract¶

Every reviewer produces findings in structured XML with severity enum:

critical — will cause an outage or is exploitable.
warning — measurable regression or concrete risk.
suggestion — an improvement worth considering.

"This ensures we are dealing with structured data that drives downstream behavior, rather than parsing advisory text."

Downstream rubric (at the coordinator's judge pass):

Condition	Decision	GitLab action
All LGTM or trivial suggestions only	`approved`	`POST /approve`
Suggestion-only items	`approved_with_comments`	`POST /approve`
Some warnings, no production risk	`approved_with_comments`	`POST /approve`
Multiple warnings suggesting a risk pattern	`minor_issues`	`POST /unapprove`
Any critical item	`significant_concerns`	`/submit_review requested_changes` (blocks merge)

Model-tier heterogeneity inside the pattern¶

Not every specialist needs the same model tier. Cloudflare's assignment:

Top tier (Opus 4.7, GPT-5.4) — Coordinator only.
Standard tier (Sonnet 4.6, GPT-5.3 Codex) — Code Quality, Security, Performance (the heavy-lifting reviewers).
Kimi K2.5 on Workers AI — Documentation, Release, AGENTS.md (text-heavy, lower reasoning demand).

Rationale: "Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers." — same structural-vs-reasoning split logic as sources/2026-04-17-cloudflare-agents-that-remember-introducing-agent-memory's Llama 4 Scout + Nemotron 3 assignment.

Production observations (first 30 days)¶

Reviewer	Total findings	Critical %
Code Quality	74,898	8.6%
Documentation	26,432	0.6%
Performance	14,615	0.4%
Security	11,985	4.0%
Codex (compliance)	9,654	2.3%
AGENTS.md	6,878	0.3%
Release	745	2.6%

Security has the highest critical rate but not the highest volume — exactly the calibration that specialisation enables. Code Quality dominates volume but criticals are meaningful too (6,460) because the domain surface is wide.

Tradeoffs¶

Prompt maintenance cost. N prompts to keep current as the codebase evolves.
Consistency risk. Overlapping findings between reviewers need the coordinator's judge pass to dedup.
Model-tier sprawl. Right-sizing per reviewer is a calibration exercise; defaults must exist.
Domain boundaries drift. A perf bug might surface in code-quality's output; coordinator re-categorises.

Seen in¶

sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — 7 specialised reviewers in production across 5,169 repos; finding distribution by reviewer disclosed.

patterns/coordinator-sub-reviewer-orchestration — the composition pattern this is the sub-agent half of.
patterns/specialized-agent-decomposition — the broader decomposition family.
concepts/what-not-to-flag-prompt — the prompt-engineering discipline per reviewer.
concepts/structured-output-reliability — why structured XML matters for downstream action.
concepts/llm-as-judge — the coordinator's consolidation discipline.
systems/cloudflare-ai-code-review — canonical production instance.