Code Orange: Fail Small is complete. The result is a stronger Cloudflare network¶
Summary¶
On 2026-05-01, Cloudflare announced the completion of Code Orange: Fail Small — the ~6-month ("two and a bit quarters") organisation-wide engineering-resiliency programme launched in response to the 2025-11-18 and 2025-12-05 global outages. The post is the follow-up structural RCA to both incidents: rather than re-describing the bugs, it catalogs the shipped remediation projects that would have prevented them and the enforcement mechanisms that will prevent their class from recurring.
The programme ships across five tracks: (1) safer configuration changes via a new internal system called Snapstone that brings health-mediated deployment to config changes by default; (2) reducing the impact of failure via systematic review of runtime dependencies + "fail stale" / fail open / fail closed per module + customer-cohort segmentation of core services like the Workers runtime; (3) revised break-glass and incident-management procedures with backup authorisation pathways for 18 key services + a dedicated communications team that drills alongside incident responders; (4) an internal engineering Codex — a living repository of engineering rules codified from RFCs and enforced via AI code review at merge time across the entire codebase; (5) tightened external communication via global changelog + maintenance coordination + predictable-interval customer updates during active incidents.
Post frames the programme as complete but not final: "improving resiliency will never be a 'job done'"; explicit stance is that the completed work "would have avoided" the November 18 and December 5 outages — a direct evaluation of the remediation backlog against its originating incidents.
Key takeaways¶
-
Snapstone is the new canonical configuration-deployment system. The post names an in-house component that "bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles" — identical discipline to the Workers runtime's staged software deploys. "Before Snapstone, applying this methodology to config was possible but difficult. It required significant per-team effort and wasn't consistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default." Critical design property (flexibility): "teams create these configuration units on demand" — any config pattern identified as risky can be "brought into Snapstone" and inherits safe deployment. This canonicalises patterns/config-deployment-as-code-deployment ("apply the same methodology we use when releasing software, for all configuration deployments") and concepts/health-mediated-deployment as a first-class primitive. Directly addresses both 2025-11-18 (doubled feature file) and 2025-12-05 (internal WAF testing-tool disable flag) — both incidents were triggered by config changes that went fleet-wide in seconds via the global configuration system with no canary, no health gating, no automated rollback. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
"Fail stale" is named as the preferred failure-mode default over fail-open or fail-closed. The post introduces an explicit three-way ladder — "use the last known good configuration where possible ('fail stale'), and if that isn't possible we have reviewed each failure case and implemented 'fail open' or 'fail close' depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic." Fail-stale is a stronger promise than fail-open: keep serving with correct (if outdated) behaviour rather than serving without scoring. Worked example lifted directly from November 2025: "if data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime." Canonicalised as concepts/fail-stale and extends concepts/fail-open-vs-fail-closed from binary to ternary. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
Workers runtime is segmented into multiple independent service copies by customer cohort. "The Workers runtime system is segmented into multiple independent services handling different cohorts of traffic, with one handling only traffic for our free customers. Changes are deployed to these segments based on customer cohorts, starting with free customers first. We're also sending updates more quickly and frequently to the least critical segments, and at a slower pace to the most critical segments." The property this buys: "if a change were deployed to the Workers runtime system and it broke traffic, it would now only affect a small percentage of our free customers before being automatically detected and rolled back." Quantified operational datum — "in a seven-day period earlier this month, the deployment process was triggered more than 50 times" with deployments fanning out "in 'waves' as the change propagates to the edge, often in parallel to the following and prior releases." Canonicalises patterns/customer-cohort-segmented-service-instances and concepts/traffic-cohort-segmentation; the free-customers-first ordering is the zero-customer-impact canary when a break is detected. Roadmap: "we're working on extending this pattern of deployment to many more of our systems in the future." (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
Break-glass pathways broadened from a handful of people to 18 key services, then drilled at scale. Cloudflare's 2025 state: "Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency: if a network-wide outage impacts these tools, we lose the very pathways we need to fix them." Canonicalises concepts/dependency-on-self as the structural hazard. Remediation: "a comprehensive audit of the tools essential for system visibility, debugging, and production changes. We ultimately developed backup authorization pathways for 18 key services, supported by new emergency scripts and proxies." Critically — drilled in practice: "After small-team exercises, we conducted an engineering-wide drill on April 7, 2026, involving more than 200 team members. While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure." First wiki canonicalisation of concepts/drill-muscle-memory — exercise is the test that the pathway actually works when the normal path is gone. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
Dedicated communications team drills alongside incident responders. Second-order problem named explicitly: "Historically, technical observations from the heat of the moment didn't always translate into clear updates for our customers." Remediation: "we established a dedicated communications team to work in lockstep with incident responders during major events. Just as our engineers practiced their 'break glass' procedures, this team used the Code Orange program to drill on streamlining the cadence and clarity of customer updates. By ensuring we have both the tools to see and the structure to speak, we can resolve incidents faster and keep our customers better informed." Customer-facing cadence made concrete: "during an active incident, we now provide updates at predictable intervals (e.g., every 30 or 60 minutes), even if the update is simply, 'We are still testing the fix; no new changes yet.' This allows you to plan your day rather than constantly refreshing a status page." (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
The Codex codifies engineering rules and enforces them at merge time via AI code review. New internal artefact: "an internal Codex that solidifies all our guidelines in clear and concise rules. The Codex is now mandatory for all engineering and product teams, and has become a central part of Cloudflare internal procedures. Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase." The stated thesis: "Build institutional memory that enforces itself." Two concrete rule examples named: "Do not use
.unwrap()outside of tests andbuild.rs" (would have prevented the 2025-11-18 FL2 panic, see concepts/unhandled-rust-panic); "Services MUST validate that upstream dependencies are in an expected state before processing" (would have caught the 2025-12-05 Lua nil-index, see concepts/nil-index-lua-bug). Framed as a flywheel: "expertise becomes standards, standards become enforcement, enforcement raises the floor for everyone." Shift-left framing: "from 'global outage' to 'rejected merge request.' The blast radius of a violation shrinks from millions of affected requests to a single developer getting actionable feedback before their code ever reaches production." Canonicalises concepts/rfc-as-codified-engineering-rule + patterns/codex-enforced-via-ai-code-review and establishes the Codex as the institutional-memory substrate. Extends systems/cloudflare-ai-code-review with a new enforcement tier — rules authored by domain experts through the RFC process, distilled into "If you need X, use Y" rule format with a link back to the RFC. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete) -
Remediation-backlog-as-public-record discipline has a completed-arc instance. Cloudflare's established post-mortem shape is to name the missing discipline rather than the specific bug, and publish the remediation backlog. The 2025-12-05 post explicitly called out the 2025-11-18 remediation projects as still-incomplete. This 2026-05-01 post closes the loop: "we have now completed the work that would have avoided the November 18, 2025 and December 5, 2025 global outages." That sentence is the evaluation of the backlog against its origin incidents. The discipline is canonicalised as the progressive- configuration-rollout + global-feature- killswitch + harden- ingestion-of-internal-config triad — now all three have shipped-in-production instances, not just stated-remediation instances. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
Enforcement is systemic, not review-dependent. The Codex's load-bearing property is that AI code review runs on every MR in the entire codebase, without exception. Traditional "shared-knowledge-in-senior-engineers-heads" is a single-reviewer-failure surface: the reviewer who knew the rule isn't on the MR; the reviewer who's on the MR doesn't know the rule; the rule isn't enforced. The AI-review enforcement tier makes the rule independent of which reviewers happen to be attached to the MR. "The Codex integrates with AI-powered agents at every stage of the software development lifecycle, from design review through deployment to incident analysis." This canonicalises patterns/codex-enforced-via-ai-code-review as enforcement-at-every-lifecycle-stage, not just merge-request-review. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
-
SLO + global changelog + maintenance-coordination as the soft-infrastructure tier. Beyond the technical remediations, Cloudflare "introduced additional service level objectives (SLOs) to all our services, enforced a global changelog, onboarded all teams to our maintenance coordination system, and improved transparency across the company on our incident 'prevents' ticket backlog." First-wiki-instance of a global changelog as a cross-organisation discipline — every team's changes visible on one feed; a "prevents ticket backlog" is the incident-prevention-work catalog evaluated for completeness. (Source: sources/2026-05-01-cloudflare-code-orange-fail-small-complete)
Systems / concepts / patterns extracted¶
New wiki systems surfaced: systems/snapstone — Cloudflare's internal configuration-deployment system; bundles a config change into a package and applies health-mediated progressive rollout by default; any team can define a new "configuration unit" to bring under Snapstone. systems/cloudflare-codex — the living repository of engineering rules; mandatory for all engineering and product teams; rules authored via the RFC process and enforced via AI code review across the entire codebase.
New wiki concepts surfaced: concepts/health-mediated-deployment — progressive rollout with real-time health monitoring and automated rollback; generalises the Workers-runtime software-deploy discipline to config. concepts/fail-stale — use last-known-good configuration on ingest failure; stronger than fail-open (correct behaviour with stale data vs. degraded behaviour with no data). concepts/traffic-cohort-segmentation — run independent service copies for different cohorts of traffic (e.g., free vs. paid customers); deploy by cohort with least-critical first. concepts/rfc-as-codified-engineering-rule — engineering RFCs distilled into actionable rules ("If you need X, use Y") with back-reference to the RFC; the primitive the Codex is built on. concepts/dependency-on-self — an organisation's runtime dependency on the product it operates; canonicalised here via "Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency." concepts/drill-muscle-memory — the discipline of exercising emergency pathways regularly so operators can use them under incident pressure; 2026-04-07 engineering-wide drill with 200+ team members is the canonical instance. concepts/institutional-memory — the organisational property the Codex is explicitly designed to build: "build institutional memory that enforces itself."
New wiki patterns surfaced: patterns/config-deployment-as-code-deployment — apply the same staged-rollout + health-gating + automated-rollback discipline to configuration changes that already governs code deployments; Snapstone is the system-tier instance. patterns/customer-cohort-segmented-service-instances — independent copies of a critical service handle different customer cohorts; deploys cascade by cohort with least-critical-first ordering for bounded blast radius; Workers runtime is the canonical instance. patterns/codex-enforced-via-ai-code-review — institutional engineering rules codified as a machine-consumable ruleset and enforced via AI code review on every MR; shift-left from global outage to rejected merge request; the Codex + AI Code Review composition is the canonical instance.
Existing wiki pages extended:
patterns/progressive-configuration-rollout +
patterns/global-feature-killswitch +
patterns/harden-ingestion-of-internal-config — all three
stated-remediation patterns now have shipped-in-production
Cloudflare instances via Snapstone; move from
"stated remediation" to "completed remediation" status.
patterns/global-configuration-push — antipattern now has an
explicit "remediation landed" note: Snapstone composes onto
the rapid-delivery channel with health gating rather than
replacing it. concepts/fail-open-vs-fail-closed — extended
from binary to ternary with concepts/fail-stale as the
preferred default. concepts/global-configuration-system —
Snapstone is the strategic-system complement (rapid threat
response still available; health-mediated rollout the default).
concepts/unhandled-rust-panic — .unwrap()-ban Codex rule is
the institutional-enforcement remediation for the 2025-11-18 FL2
panic class. concepts/nil-index-lua-bug —
"validate upstream dependency state before processing" Codex
rule is the institutional-enforcement remediation for the
2025-12-05 Lua nil-index class (though Lua-specifically is
addressed more fundamentally by FL1 → FL2 migration).
concepts/internally-generated-untrusted-input — the
2025-11-18 "#1 remediation" naming this discipline is
codex-enforced on every MR.
systems/cloudflare-bot-management — the feature-file
generator is the canonical Snapstone-adopted workload from
2025-11-18; health-mediated deployment + ingest validation now
wrap what was a raw fleet-wide push. systems/cloudflare-workers
— Workers runtime system is segmented into multiple independent
services by customer cohort; free-customer-first deployment order
+ 50+ deploys/7-day operational datum.
systems/cloudflare-ai-code-review — Codex enforcement tier
added as a new pluggable ruleset alongside the
security/performance/code-quality/documentation/release sub-
reviewers already canonicalised.
Operational numbers¶
- ~50 Workers-runtime deploys in 7 days (canonical datum for concepts/traffic-cohort-segmentation at cadence).
- 18 key services have backup authorisation pathways for break-glass.
- ~200 team members in the April 7, 2026 engineering-wide drill.
- ~30 / ~60 minute cadence for active-incident customer updates ("predictable intervals").
- Every MR across the entire codebase is reviewed against the Codex via AI code review — "applied without exception".
- ~6 months ("two and a bit quarters") Code Orange duration end-to-end.
Caveats¶
- Marketing-framing layer. "What it means for you" italic summaries per section are customer-facing narrative; the engineering content is in the paragraphs beneath them.
- Snapstone mechanism semi-disclosed. Post names the system, its design contract (bundle + progressive release + health mediation + automated rollback), and its flexibility property ("teams create these configuration units on demand"), but doesn't describe internal architecture, health-signal sources, rollout-staging cadence, rollback-trigger thresholds, storage substrate, or how a team onboards a new configuration unit.
- Codex mechanism semi-disclosed. Post names the artefact (living repository; mandatory; rule format "If you need X, use Y" with RFC link; ~2 concrete rules named) and its enforcement tier (AI code reviews; applied without exception to entire codebase), but doesn't quantify the rule count, false-positive rate, reviewer-override mechanics, or how rules evolve operationally. Inferred from 2026-04-20 AI Code Review post that Codex is likely implemented as one of the named sub-reviewers; not explicitly stated here.
- Workers-segmentation scope. "One segment handles only free-customer traffic" is the only cohort explicitly named; no enumeration of other segments, cohort-assignment rules, cross-segment routing, or segment-cardinality bound.
- Completed-but-not-final framing. "Improving resiliency will never be a 'job done'" — the post is explicit that Code Orange is the completed programme, not a claim of perfect resilience. The four other tracks ("preventing drift and regressions over time" + SLOs + changelog + maintenance coordination) get lighter coverage than the core five.
- Evaluation claim asymmetric. The claim "we have now completed the work that would have avoided" 2025-11-18 and 2025-12-05 is the organisation's own evaluation; a confirmatory test (e.g., re-running a synthetic doubled feature-file or executing-rule-killswitch scenario against the post-Code-Orange stack) is not disclosed.
- AI-code-review failure modes not addressed. The Codex enforcement tier inherits the AI-code-review failure modes (hallucinated findings, missed rule applications, prompt- injection on MR-body content) canonicalised in the 2026-04-20 AI Code Review post. The 2026-05-01 post does not revisit these. The "additional manual reviews" escape hatch for flag divergences is named but not quantified.
- Workers-first scope of customer-cohort-segmentation. The stated roadmap is to extend the pattern "to many more of our systems"; the post doesn't name which ones are next nor what the target coverage is.
Source¶
- Original: https://blog.cloudflare.com/code-orange-fail-small-complete/
- Raw markdown:
raw/cloudflare/2026-05-01-code-orange-fail-small-is-complete-the-result-is-a-stronger-70dd3b18.md
Related¶
- sources/2025-11-18-cloudflare-outage-on-november-18-2025 — origin incident #1; the remediation evaluated here is against this.
- sources/2025-12-05-cloudflare-outage-on-december-5-2025 — origin incident #2; the second post-mortem that named the 11-18 remediation projects as still-incomplete.
- sources/2025-07-16-cloudflare-1111-incident-on-july-14-2025 — sibling Cloudflare post-mortem that established the "name the missing discipline, not just the bug" public-post-mortem shape; progressive-configuration-rollout was first stated there as the missing discipline.
- sources/2026-04-20-cloudflare-orchestrating-ai-code-review-at-scale — Cloudflare's CI-native AI code review system; the Codex enforcement tier is a new pluggable ruleset sitting on top of this substrate; the 2026-04-20 post was already explicitly positioned "as part of Code Orange: Fail Small".
- systems/snapstone — the canonical configuration-deployment system introduced by this post.
- systems/cloudflare-codex — the living engineering-rules registry introduced by this post.
- concepts/health-mediated-deployment
- concepts/fail-stale
- concepts/traffic-cohort-segmentation
- concepts/rfc-as-codified-engineering-rule
- patterns/config-deployment-as-code-deployment
- patterns/customer-cohort-segmented-service-instances
- patterns/codex-enforced-via-ai-code-review