SYSTEM Cited by 1 source
Cloudflare AI Code Review¶
Cloudflare AI Code Review is Cloudflare's internal CI-native AI code-review orchestration system, shipped as a GitLab CI component ($CI_SERVER_FQDN/ci/ai/opencode@~latest). Every merge request triggers a OpenCode coordinator agent that spawns up to seven specialised sub-reviewers (security, performance, code quality, documentation, release, AGENTS.md, engineering-codex compliance) through a plugin-composition architecture. The coordinator performs a judge pass (dedup / re-categorise / drop false positives / read source to verify) and posts a single structured review comment to GitLab, with an overall verdict that drives approve / approved_with_comments / unapprove / requested_changes actions via the MCP comment server.
Announced 2026-04-20 after "about a month" of internal use. Part of Cloudflare's Code Orange: Fail Small engineering-resiliency programme.
Architectural shape¶
- Coordinator process spawned as
Bun.spawnchild with prompt piped via stdin (not argv, to avoidARG_MAX/E2BIGon large MRs). Runs--format json→ JSONL events on stdout, buffered 100 lines / 50 ms before flush. - Sub-reviewers launched via the coordinator's
spawn_reviewerstool → OpenCode SDKsession.create+session.promptAsync. Each runs in its own session with its own agent prompt; free to read source, grep, search the codebase; returns structured XML findings. - Plugin composition — each plugin implements
ReviewPluginwithbootstrap(concurrent, non-fatal),configure(sequential, fatal),postConfigure(async). Contribute to the build viaConfigureContextrather than mutating the final config — the core assembler merges intoopencode.json. - No cross-plugin coupling. "The GitLab plugin doesn't read Cloudflare AI Gateway configurations, and the Cloudflare plugin doesn't know anything about GitLab API tokens. All VCS-specific coupling is isolated in a single
ci-config.tsfile."
See patterns/coordinator-sub-reviewer-orchestration for the full shape and patterns/specialized-reviewer-agents for the domain decomposition.
Plugin roster¶
| Plugin | Responsibility |
|---|---|
@opencode-reviewer/gitlab |
GitLab VCS provider, MR data, MCP comment server |
@opencode-reviewer/cloudflare |
AI Gateway config, model tiers, failback chains |
@opencode-reviewer/codex |
Internal compliance vs. engineering RFCs |
@opencode-reviewer/braintrust |
Distributed tracing + observability |
@opencode-reviewer/agents-md |
AGENTS.md staleness / anti-pattern checks |
@opencode-reviewer/reviewer-config |
Remote per-reviewer model overrides via KV Worker |
@opencode-reviewer/telemetry |
Fire-and-forget review tracking |
@opencode-reviewer/local |
/fullreview TUI command for local runs |
Risk tiering¶
Every MR is classified before any model runs — see patterns/ai-review-risk-tiering. A post-filter pipeline strips noise files (lock files, minified assets, .map, @generated headers) — database migrations explicitly exempted.
| Tier | Criteria | Agents | Notes |
|---|---|---|---|
| Trivial | ≤10 lines, ≤20 files | Coordinator + 1 generalised reviewer | Coordinator downgraded Opus → Sonnet |
| Lite | ≤100 lines, ≤20 files | Coordinator + code quality + documentation + 1 more | |
| Full | >100 lines OR >50 files OR security-sensitive paths | Coordinator + 7+ specialists | Anything touching auth/ / crypto/ always full |
Spend distribution (first 30 days):
| Tier | Reviews | Avg cost |
|---|---|---|
| Trivial | 24,529 | $0.20 |
| Lite | 27,558 | $0.67 |
| Full | 78,611 | $1.68 |
Model tiering¶
Model choice is not monotonic with parameter count — each reviewer gets the model matched to its reasoning demands. All assignments are overridable at runtime via the reviewer-config KV Worker (flip-switch-in-KV → re-route in 5 seconds).
| Tier | Models | Role |
|---|---|---|
| Top | Claude Opus 4.7, GPT-5.4 | Review Coordinator only |
| Standard | Claude Sonnet 4.6, GPT-5.3 Codex | Code Quality, Security, Performance |
| Kimi K2.5 | on Workers AI | Documentation, Release, AGENTS.md |
Resilience¶
Hystrix-style circuit breaker per model tier with per-family failback chains:
DEFAULT_FAILBACK_CHAIN = {
"opus-4-7": "opus-4-6",
"opus-4-6": null,
"sonnet-4-6": "sonnet-4-5",
"sonnet-4-5": null,
}
- Opens on failures → 2-minute cooldown → one probe in HALF_OPEN → CLOSED or back to OPEN.
- Same-family only (never crosses Anthropic ↔ OpenAI).
- Error classifier decides
shouldFailback: retryable API errors →true; auth / context-overflow / abort / structured-output errors →false.
Coordinator-level failback is distinct: orchestrator scans child-process stderr for "overloaded" / "503", rewrites opencode.json's review_coordinator.model on disk, restarts the process.
Three-level timeouts: per-task 5 min (10 for code quality), overall 25 min, retry-budget minimum 2 min. Inactivity (60 s with no output) → kill and mark error. Completion primarily via session.idle events; 3-second polling as fallback.
Prompt engineering¶
- Agent-specific.md + REVIEWER_SHARED.md concatenated at runtime. Shared file carries mandatory rules applicable to every reviewer.
- "What NOT to Flag" is load-bearing. See concepts/what-not-to-flag-prompt — telling the model what to skip is where prompt value accrues.
- Structured XML output with severity classification:
critical/warning/suggestion. Downstream actions are keyed off severity, not advisory text. - Prompt-boundary-tag sanitization. Protected tags (
mr_body,mr_details,changed_files,previous_review,custom_review_instructions, etc.) stripped from user-controlled content — see concepts/prompt-boundary-sanitization. - Shared-context file + per-file patches. Sub-reviewers read
shared-mr-context.txtand per-filediff_directoryfiles — not duplicated full context. See concepts/shared-context-fan-out. 85.7% prompt-cache hit rate in production validates the approach.
AGENTS.md reviewer¶
Dedicated specialised reviewer that scores every MR against the repo's AGENTS.md staleness risk. High/medium/low materiality tiers drive severity of the warning. Also penalises anti-patterns in existing AGENTS.md files: generic filler ("write clean code"), >200-line bloat, tool names without runnable commands.
Operational primitives¶
- AI thinking heartbeat —
"Model is thinking... (Ns since last output)"every 30 s on stdout. Prevents users from mistaking frontier-model deliberation for a hung job. - Break glass —
break glasscomment on an MR forces approval regardless of AI verdict. Tracked in telemetry; invoked on 288 MRs (0.6%) in first 30 days, doubles as a provider-outage / latent-bug signal. - Incremental re-review — coordinator receives last review comment + prior DiffNotes + resolution status; strict rules map fixed/unfixed/user-resolved/user-replied to re-emit behaviours. Avg 2.7 reviews per MR.
Internal deployment¶
- GitLab CI component —
include: - component: $CI_SERVER_FQDN/ci/ai/opencode@~latest. Component handles Docker pull, Vault secrets, review execution, comment posting. - Per-repo AGENTS.md drops local review instructions. Teams can point at an AGENTS.md template URL that gets injected into all agent prompts (org-wide convention propagation without per-repo duplication).
- Local mode:
@opencode-reviewer/localplugin provides/fullreviewinside the OpenCode TUI — same agents + prompts + risk assessment, runs on working-tree diff, posts inline.
Control plane¶
reviewer-configCloudflare Worker + KV returns per-reviewer model assignments + providers block. Per-providerenabledflag filters models pre-selection. Carries failback-chain overrides. Flip-switch re-routes every running CI job within 5 seconds.TrackerClient— fire-and-forget to a separate Cloudflare Worker; 2-secondAbortSignal.timeout; prunes pending if >50 queued. Prometheus metrics batched on next microtask, flushed pre-exit via Workers Logging.
Production scale (first 30 days, 2026-03-10 → 2026-04-09)¶
- 131,246 review runs across 48,095 MRs in 5,169 repos
- Avg 2.7 reviews per MR; median 3m 39s, P99 10m 21s
- Median cost $0.98, P99 $4.45
- 159,103 findings total, ~1.2 per review (deliberately low)
- ~120 B tokens processed; 85.7% prompt-cache hit rate
- 288 break-glass invocations (0.6%)
- 45+ upstream OpenCode PRs contributed back
See sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale for full percentile breakdowns by tier, reviewer, and model family.
Caveats named in the post¶
- No architectural awareness — reviewers see the diff + surrounding code, not why the system was designed that way.
- No cross-system impact tracking — contract change flagged, but downstream consumers not verified.
- Subtle concurrency bugs hard to catch from static diffs — reviewer can spot missing locks, not deadlock paths.
- Cost scales with diff size; coordinator warns when prompt >50% of estimated context window.
- "Not a replacement for human code review, at least not yet with today's models."
Related¶
- systems/opencode — the open-source coding agent this orchestration is built on.
- systems/cloudflare-ai-gateway — substrate for all LLM calls; failback chains configured here.
- systems/model-context-protocol — the comment-server-over-MCP interface between OpenCode and GitLab.
- patterns/coordinator-sub-reviewer-orchestration — canonical pattern instance.
- patterns/specialized-reviewer-agents — domain-per-reviewer decomposition.
- patterns/ai-review-risk-tiering — trivial / lite / full tier gating.
- patterns/central-proxy-choke-point — AI-Gateway-as-single-ingress posture AI Code Review inherits.
- sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — source post.
- companies/cloudflare