SYSTEM Cited by 1 source
Snapstone¶
Snapstone is Cloudflare's internal configuration-deployment system, introduced publicly in the 2026-05-01 Code Orange: Fail Small is complete post. Snapstone brings health-mediated progressive deployment to configuration changes, by default and across teams, in a way that previously required significant per-team engineering and was therefore inconsistently applied.
Design contract¶
Directly from the 2026-05-01 post:
Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles.
Three properties make it load-bearing:
- Bundle the change into a package. A single atomic unit the rollout orchestrator can advance through stages.
- Gradual release with health mediation. Canary → small cohort → progressive fan-out, gated by real-time health signals at each step.
- Automated rollback on regression. Bad config detected at any stage is reverted without paging a human.
This is the same discipline Cloudflare already applied to software deploys on the Workers runtime; Snapstone ports it to the configuration plane across all teams — the move canonicalised as patterns/config-deployment-as-code-deployment.
Flexibility: configuration unit on demand¶
Snapstone's load-bearing product feature is flexibility:
What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.
The consequence is strategic:
This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward — bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.
The operational workflow is: risk review surfaces a configuration pattern with unbounded blast radius → team onboards it as a Snapstone configuration unit → pattern inherits staged rollout + health gating + automated rollback without per-team implementation effort.
Canonical workloads it covers¶
Named instances from the 2026-05-01 post, both mapped directly to incidents:
- Bot Management feature file (see systems/cloudflare-bot-management). The 2025-11-18 trigger: a ClickHouse permission migration caused the feature-file generator to emit a file with doubled rows, which propagated fleet-wide in seconds because the feature-file channel had no staged rollout. Under Snapstone, the doubled file would be detected as degraded during the first cohort's rollout and reverted before reaching production. See sources/2025-11-18-cloudflare-outage-on-november-18-2025.
- Control flags in the global configuration system — e.g., the internal WAF testing-tool disable flag from 2025-12-05 that triggered a seven-year-old Lua nil-index bug in FL1's rulesets engine. Under Snapstone, the flag flip would be released gradually with health monitoring; the first cohort's 5xx spike would trigger automated rollback within a minute or two, not after the flag had hit 28% of traffic. See sources/2025-12-05-cloudflare-outage-on-december-5-2025.
The stated scope is not specific incidents — it's any configuration unit a team or risk review brings in.
Composes with the global configuration system¶
Snapstone is additive, not replacement. The rapid-delivery global configuration system still exists and is still required for genuine threat-response scenarios (DDoS mitigations, zero-day WAF rules, fleet-wide bad-IP blacklists) where a canary measured in minutes defeats the purpose of rapid response. Snapstone becomes the default for any configuration change that doesn't justify the rapid channel, and risk reviews now have a straightforward disposition for configuration units identified as dangerous: onboard to Snapstone.
See patterns/global-configuration-push for the antipattern framing of rapid-only; Snapstone is the structural compensating pattern along with patterns/global-feature-killswitch and patterns/harden-ingestion-of-internal-config.
Relationship to the Cloudflare remediation triad¶
The three stated remediation patterns from the 2025-11-18 and 2025-12-05 post-mortems — patterns/progressive-configuration-rollout, patterns/global-feature-killswitch, and patterns/harden-ingestion-of-internal-config — all have shipped-in-production instances after Code Orange. Snapstone is the system-tier realisation of the first of these; the other two compose onto it (killswitch is orthogonal to the rollout channel; ingest hardening is downstream of whatever channel delivers the config). Together with concepts/fail-stale (the preferred failure-mode default), the four patterns constitute Cloudflare's completed config-plane defense-in-depth posture.
What the post does not disclose¶
- Internal architecture — storage substrate, orchestrator, deployment topology, rollout-stage granularity.
- Health-signal sources — which metrics gate which stages; SLO wiring.
- Rollback thresholds — error-rate / latency / other triggers; quiescence windows; automated-vs-human rollback disposition.
- Onboarding mechanics — how a team declares a new configuration unit; what manifests / schemas / DSLs are involved.
- Rollout-cadence defaults — cohort sizes, dwell times, per-cohort health-gate ceremonies.
- Scale — how many configuration units are now onboarded; what % of configuration changes now flow through Snapstone vs the rapid channel.
Future Cloudflare posts (or deep-dives in the Code Orange lineage) would be the natural disclosure surface for these mechanisms.
Seen in¶
- sources/2026-05-01-cloudflare-code-orange-fail-small-complete — canonical wiki instance; the system is introduced and named for the first time publicly in this post. Design contract + flexibility property + mapped incident workloads + compose- with-global-config-system framing all come from the post.
Related¶
- concepts/health-mediated-deployment
- concepts/global-configuration-system
- concepts/fail-stale
- patterns/progressive-configuration-rollout
- patterns/config-deployment-as-code-deployment
- patterns/global-configuration-push
- patterns/harden-ingestion-of-internal-config
- patterns/global-feature-killswitch
- systems/cloudflare-bot-management
- systems/cloudflare-workers
- sources/2025-11-18-cloudflare-outage-on-november-18-2025
- sources/2025-12-05-cloudflare-outage-on-december-5-2025