Skip to content

SYSTEM Cited by 1 source

Cloudflare AI Code Review

Cloudflare AI Code Review is Cloudflare's internal CI-native AI code-review orchestration system, shipped as a GitLab CI component ($CI_SERVER_FQDN/ci/ai/opencode@~latest). Every merge request triggers a OpenCode coordinator agent that spawns up to seven specialised sub-reviewers (security, performance, code quality, documentation, release, AGENTS.md, engineering-codex compliance) through a plugin-composition architecture. The coordinator performs a judge pass (dedup / re-categorise / drop false positives / read source to verify) and posts a single structured review comment to GitLab, with an overall verdict that drives approve / approved_with_comments / unapprove / requested_changes actions via the MCP comment server.

Announced 2026-04-20 after "about a month" of internal use. Part of Cloudflare's Code Orange: Fail Small engineering-resiliency programme.

Architectural shape

  • Coordinator process spawned as Bun.spawn child with prompt piped via stdin (not argv, to avoid ARG_MAX / E2BIG on large MRs). Runs --format jsonJSONL events on stdout, buffered 100 lines / 50 ms before flush.
  • Sub-reviewers launched via the coordinator's spawn_reviewers tool → OpenCode SDK session.create + session.promptAsync. Each runs in its own session with its own agent prompt; free to read source, grep, search the codebase; returns structured XML findings.
  • Plugin composition — each plugin implements ReviewPlugin with bootstrap (concurrent, non-fatal), configure (sequential, fatal), postConfigure (async). Contribute to the build via ConfigureContext rather than mutating the final config — the core assembler merges into opencode.json.
  • No cross-plugin coupling. "The GitLab plugin doesn't read Cloudflare AI Gateway configurations, and the Cloudflare plugin doesn't know anything about GitLab API tokens. All VCS-specific coupling is isolated in a single ci-config.ts file."

See patterns/coordinator-sub-reviewer-orchestration for the full shape and patterns/specialized-reviewer-agents for the domain decomposition.

Plugin roster

Plugin Responsibility
@opencode-reviewer/gitlab GitLab VCS provider, MR data, MCP comment server
@opencode-reviewer/cloudflare AI Gateway config, model tiers, failback chains
@opencode-reviewer/codex Internal compliance vs. engineering RFCs
@opencode-reviewer/braintrust Distributed tracing + observability
@opencode-reviewer/agents-md AGENTS.md staleness / anti-pattern checks
@opencode-reviewer/reviewer-config Remote per-reviewer model overrides via KV Worker
@opencode-reviewer/telemetry Fire-and-forget review tracking
@opencode-reviewer/local /fullreview TUI command for local runs

Risk tiering

Every MR is classified before any model runs — see patterns/ai-review-risk-tiering. A post-filter pipeline strips noise files (lock files, minified assets, .map, @generated headers) — database migrations explicitly exempted.

Tier Criteria Agents Notes
Trivial ≤10 lines, ≤20 files Coordinator + 1 generalised reviewer Coordinator downgraded Opus → Sonnet
Lite ≤100 lines, ≤20 files Coordinator + code quality + documentation + 1 more
Full >100 lines OR >50 files OR security-sensitive paths Coordinator + 7+ specialists Anything touching auth/ / crypto/ always full

Spend distribution (first 30 days):

Tier Reviews Avg cost
Trivial 24,529 $0.20
Lite 27,558 $0.67
Full 78,611 $1.68

Model tiering

Model choice is not monotonic with parameter count — each reviewer gets the model matched to its reasoning demands. All assignments are overridable at runtime via the reviewer-config KV Worker (flip-switch-in-KV → re-route in 5 seconds).

Tier Models Role
Top Claude Opus 4.7, GPT-5.4 Review Coordinator only
Standard Claude Sonnet 4.6, GPT-5.3 Codex Code Quality, Security, Performance
Kimi K2.5 on Workers AI Documentation, Release, AGENTS.md

Resilience

Hystrix-style circuit breaker per model tier with per-family failback chains:

DEFAULT_FAILBACK_CHAIN = {
  "opus-4-7":   "opus-4-6",
  "opus-4-6":   null,
  "sonnet-4-6": "sonnet-4-5",
  "sonnet-4-5": null,
}
  • Opens on failures → 2-minute cooldown → one probe in HALF_OPEN → CLOSED or back to OPEN.
  • Same-family only (never crosses Anthropic ↔ OpenAI).
  • Error classifier decides shouldFailback: retryable API errors → true; auth / context-overflow / abort / structured-output errors → false.

Coordinator-level failback is distinct: orchestrator scans child-process stderr for "overloaded" / "503", rewrites opencode.json's review_coordinator.model on disk, restarts the process.

Three-level timeouts: per-task 5 min (10 for code quality), overall 25 min, retry-budget minimum 2 min. Inactivity (60 s with no output) → kill and mark error. Completion primarily via session.idle events; 3-second polling as fallback.

Prompt engineering

  • Agent-specific.md + REVIEWER_SHARED.md concatenated at runtime. Shared file carries mandatory rules applicable to every reviewer.
  • "What NOT to Flag" is load-bearing. See concepts/what-not-to-flag-prompt — telling the model what to skip is where prompt value accrues.
  • Structured XML output with severity classification: critical / warning / suggestion. Downstream actions are keyed off severity, not advisory text.
  • Prompt-boundary-tag sanitization. Protected tags (mr_body, mr_details, changed_files, previous_review, custom_review_instructions, etc.) stripped from user-controlled content — see concepts/prompt-boundary-sanitization.
  • Shared-context file + per-file patches. Sub-reviewers read shared-mr-context.txt and per-file diff_directory files — not duplicated full context. See concepts/shared-context-fan-out. 85.7% prompt-cache hit rate in production validates the approach.

AGENTS.md reviewer

Dedicated specialised reviewer that scores every MR against the repo's AGENTS.md staleness risk. High/medium/low materiality tiers drive severity of the warning. Also penalises anti-patterns in existing AGENTS.md files: generic filler ("write clean code"), >200-line bloat, tool names without runnable commands.

Operational primitives

  • AI thinking heartbeat"Model is thinking... (Ns since last output)" every 30 s on stdout. Prevents users from mistaking frontier-model deliberation for a hung job.
  • Break glassbreak glass comment on an MR forces approval regardless of AI verdict. Tracked in telemetry; invoked on 288 MRs (0.6%) in first 30 days, doubles as a provider-outage / latent-bug signal.
  • Incremental re-review — coordinator receives last review comment + prior DiffNotes + resolution status; strict rules map fixed/unfixed/user-resolved/user-replied to re-emit behaviours. Avg 2.7 reviews per MR.

Internal deployment

  • GitLab CI componentinclude: - component: $CI_SERVER_FQDN/ci/ai/opencode@~latest. Component handles Docker pull, Vault secrets, review execution, comment posting.
  • Per-repo AGENTS.md drops local review instructions. Teams can point at an AGENTS.md template URL that gets injected into all agent prompts (org-wide convention propagation without per-repo duplication).
  • Local mode: @opencode-reviewer/local plugin provides /fullreview inside the OpenCode TUI — same agents + prompts + risk assessment, runs on working-tree diff, posts inline.

Control plane

  • reviewer-config Cloudflare Worker + KV returns per-reviewer model assignments + providers block. Per-provider enabled flag filters models pre-selection. Carries failback-chain overrides. Flip-switch re-routes every running CI job within 5 seconds.
  • TrackerClient — fire-and-forget to a separate Cloudflare Worker; 2-second AbortSignal.timeout; prunes pending if >50 queued. Prometheus metrics batched on next microtask, flushed pre-exit via Workers Logging.

Production scale (first 30 days, 2026-03-10 → 2026-04-09)

  • 131,246 review runs across 48,095 MRs in 5,169 repos
  • Avg 2.7 reviews per MR; median 3m 39s, P99 10m 21s
  • Median cost $0.98, P99 $4.45
  • 159,103 findings total, ~1.2 per review (deliberately low)
  • ~120 B tokens processed; 85.7% prompt-cache hit rate
  • 288 break-glass invocations (0.6%)
  • 45+ upstream OpenCode PRs contributed back

See sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale for full percentile breakdowns by tier, reviewer, and model family.

Caveats named in the post

  • No architectural awareness — reviewers see the diff + surrounding code, not why the system was designed that way.
  • No cross-system impact tracking — contract change flagged, but downstream consumers not verified.
  • Subtle concurrency bugs hard to catch from static diffs — reviewer can spot missing locks, not deadlock paths.
  • Cost scales with diff size; coordinator warns when prompt >50% of estimated context window.
  • "Not a replacement for human code review, at least not yet with today's models."
Last updated · 200 distilled / 1,178 read