Skip to content

SYSTEM Cited by 2 sources

Cloudflare AI Code Review

Cloudflare AI Code Review is Cloudflare's internal CI-native AI code-review orchestration system, shipped as a GitLab CI component ($CI_SERVER_FQDN/ci/ai/opencode@~latest). Every merge request triggers a OpenCode coordinator agent that spawns up to seven specialised sub-reviewers (security, performance, code quality, documentation, release, AGENTS.md, engineering-codex compliance) through a plugin-composition architecture. The coordinator performs a judge pass (dedup / re-categorise / drop false positives / read source to verify) and posts a single structured review comment to GitLab, with an overall verdict that drives approve / approved_with_comments / unapprove / requested_changes actions via the MCP comment server.

Announced 2026-04-20 after "about a month" of internal use. Part of Cloudflare's Code Orange: Fail Small engineering-resiliency programme.

Architectural shape

  • Coordinator process spawned as Bun.spawn child with prompt piped via stdin (not argv, to avoid ARG_MAX / E2BIG on large MRs). Runs --format jsonJSONL events on stdout, buffered 100 lines / 50 ms before flush.
  • Sub-reviewers launched via the coordinator's spawn_reviewers tool → OpenCode SDK session.create + session.promptAsync. Each runs in its own session with its own agent prompt; free to read source, grep, search the codebase; returns structured XML findings.
  • Plugin composition — each plugin implements ReviewPlugin with bootstrap (concurrent, non-fatal), configure (sequential, fatal), postConfigure (async). Contribute to the build via ConfigureContext rather than mutating the final config — the core assembler merges into opencode.json.
  • No cross-plugin coupling. "The GitLab plugin doesn't read Cloudflare AI Gateway configurations, and the Cloudflare plugin doesn't know anything about GitLab API tokens. All VCS-specific coupling is isolated in a single ci-config.ts file."

See patterns/coordinator-sub-reviewer-orchestration for the full shape and patterns/specialized-reviewer-agents for the domain decomposition.

Plugin roster

Plugin Responsibility
@opencode-reviewer/gitlab GitLab VCS provider, MR data, MCP comment server
@opencode-reviewer/cloudflare AI Gateway config, model tiers, failback chains
@opencode-reviewer/codex Internal compliance vs. engineering RFCs
@opencode-reviewer/braintrust Distributed tracing + observability
@opencode-reviewer/agents-md AGENTS.md staleness / anti-pattern checks
@opencode-reviewer/reviewer-config Remote per-reviewer model overrides via KV Worker
@opencode-reviewer/telemetry Fire-and-forget review tracking
@opencode-reviewer/local /fullreview TUI command for local runs

Risk tiering

Every MR is classified before any model runs — see patterns/ai-review-risk-tiering. A post-filter pipeline strips noise files (lock files, minified assets, .map, @generated headers) — database migrations explicitly exempted.

Tier Criteria Agents Notes
Trivial ≤10 lines, ≤20 files Coordinator + 1 generalised reviewer Coordinator downgraded Opus → Sonnet
Lite ≤100 lines, ≤20 files Coordinator + code quality + documentation + 1 more
Full >100 lines OR >50 files OR security-sensitive paths Coordinator + 7+ specialists Anything touching auth/ / crypto/ always full

Spend distribution (first 30 days):

Tier Reviews Avg cost
Trivial 24,529 $0.20
Lite 27,558 $0.67
Full 78,611 $1.68

Model tiering

Model choice is not monotonic with parameter count — each reviewer gets the model matched to its reasoning demands. All assignments are overridable at runtime via the reviewer-config KV Worker (flip-switch-in-KV → re-route in 5 seconds).

Tier Models Role
Top Claude Opus 4.7, GPT-5.4 Review Coordinator only
Standard Claude Sonnet 4.6, GPT-5.3 Codex Code Quality, Security, Performance
Kimi K2.5 on Workers AI Documentation, Release, AGENTS.md

Resilience

Hystrix-style circuit breaker per model tier with per-family failback chains:

DEFAULT_FAILBACK_CHAIN = {
  "opus-4-7":   "opus-4-6",
  "opus-4-6":   null,
  "sonnet-4-6": "sonnet-4-5",
  "sonnet-4-5": null,
}
  • Opens on failures → 2-minute cooldown → one probe in HALF_OPEN → CLOSED or back to OPEN.
  • Same-family only (never crosses Anthropic ↔ OpenAI).
  • Error classifier decides shouldFailback: retryable API errors → true; auth / context-overflow / abort / structured-output errors → false.

Coordinator-level failback is distinct: orchestrator scans child-process stderr for "overloaded" / "503", rewrites opencode.json's review_coordinator.model on disk, restarts the process.

Three-level timeouts: per-task 5 min (10 for code quality), overall 25 min, retry-budget minimum 2 min. Inactivity (60 s with no output) → kill and mark error. Completion primarily via session.idle events; 3-second polling as fallback.

Prompt engineering

  • Agent-specific.md + REVIEWER_SHARED.md concatenated at runtime. Shared file carries mandatory rules applicable to every reviewer.
  • "What NOT to Flag" is load-bearing. See concepts/what-not-to-flag-prompt — telling the model what to skip is where prompt value accrues.
  • Structured XML output with severity classification: critical / warning / suggestion. Downstream actions are keyed off severity, not advisory text.
  • Prompt-boundary-tag sanitization. Protected tags (mr_body, mr_details, changed_files, previous_review, custom_review_instructions, etc.) stripped from user-controlled content — see concepts/prompt-boundary-sanitization.
  • Shared-context file + per-file patches. Sub-reviewers read shared-mr-context.txt and per-file diff_directory files — not duplicated full context. See concepts/shared-context-fan-out. 85.7% prompt-cache hit rate in production validates the approach.

AGENTS.md reviewer

Dedicated specialised reviewer that scores every MR against the repo's AGENTS.md staleness risk. High/medium/low materiality tiers drive severity of the warning. Also penalises anti-patterns in existing AGENTS.md files: generic filler ("write clean code"), >200-line bloat, tool names without runnable commands.

Operational primitives

  • AI thinking heartbeat"Model is thinking... (Ns since last output)" every 30 s on stdout. Prevents users from mistaking frontier-model deliberation for a hung job.
  • Break glassbreak glass comment on an MR forces approval regardless of AI verdict. Tracked in telemetry; invoked on 288 MRs (0.6%) in first 30 days, doubles as a provider-outage / latent-bug signal.
  • Incremental re-review — coordinator receives last review comment + prior DiffNotes + resolution status; strict rules map fixed/unfixed/user-resolved/user-replied to re-emit behaviours. Avg 2.7 reviews per MR.

Internal deployment

  • GitLab CI componentinclude: - component: $CI_SERVER_FQDN/ci/ai/opencode@~latest. Component handles Docker pull, Vault secrets, review execution, comment posting.
  • Per-repo AGENTS.md drops local review instructions. Teams can point at an AGENTS.md template URL that gets injected into all agent prompts (org-wide convention propagation without per-repo duplication).
  • Local mode: @opencode-reviewer/local plugin provides /fullreview inside the OpenCode TUI — same agents + prompts + risk assessment, runs on working-tree diff, posts inline.

Control plane

  • reviewer-config Cloudflare Worker + KV returns per-reviewer model assignments + providers block. Per-provider enabled flag filters models pre-selection. Carries failback-chain overrides. Flip-switch re-routes every running CI job within 5 seconds.
  • TrackerClient — fire-and-forget to a separate Cloudflare Worker; 2-second AbortSignal.timeout; prunes pending if >50 queued. Prometheus metrics batched on next microtask, flushed pre-exit via Workers Logging.

Production scale (first 30 days, 2026-03-10 → 2026-04-09)

  • 131,246 review runs across 48,095 MRs in 5,169 repos
  • Avg 2.7 reviews per MR; median 3m 39s, P99 10m 21s
  • Median cost $0.98, P99 $4.45
  • 159,103 findings total, ~1.2 per review (deliberately low)
  • ~120 B tokens processed; 85.7% prompt-cache hit rate
  • 288 break-glass invocations (0.6%)
  • 45+ upstream OpenCode PRs contributed back

See sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale for full percentile breakdowns by tier, reviewer, and model family.

Codex enforcement tier (2026-05-01 update)

The 2026-05-01 Code Orange: Fail Small is complete post canonicalises AI Code Review as the enforcement substrate for the internal Codex — Cloudflare's living repository of engineering standards. Rules authored via the RFC process in the format "If you need X, use Y" (with a link back to the RFC for rationale) are checked on every MR across the entire codebase:

Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase.

The "engineering-codex" reviewer on the plugin roster is the plug-in point. Two rules were named in 2026-05-01: .unwrap() discipline and "services MUST validate that upstream dependencies are in an expected state before processing." Both trace to specific outage classes (2025-11-18 and 2025-12-05 respectively). The enforcement is framed as shift-left"from 'global outage' to 'rejected merge request.'"

See patterns/codex-enforced-via-ai-code-review for the reusable pattern and systems/cloudflare-codex for the canonical rule-source artefact.

Caveats named in the post

  • No architectural awareness — reviewers see the diff + surrounding code, not why the system was designed that way.
  • No cross-system impact tracking — contract change flagged, but downstream consumers not verified.
  • Subtle concurrency bugs hard to catch from static diffs — reviewer can spot missing locks, not deadlock paths.
  • Cost scales with diff size; coordinator warns when prompt >50% of estimated context window.
  • "Not a replacement for human code review, at least not yet with today's models."
Last updated · 542 distilled / 1,571 read