Skip to content

SYSTEM Cited by 1 source

Snapstone

Snapstone is Cloudflare's internal configuration-deployment system, introduced publicly in the 2026-05-01 Code Orange: Fail Small is complete post. Snapstone brings health-mediated progressive deployment to configuration changes, by default and across teams, in a way that previously required significant per-team engineering and was therefore inconsistently applied.

Design contract

Directly from the 2026-05-01 post:

Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles.

Three properties make it load-bearing:

  1. Bundle the change into a package. A single atomic unit the rollout orchestrator can advance through stages.
  2. Gradual release with health mediation. Canary → small cohort → progressive fan-out, gated by real-time health signals at each step.
  3. Automated rollback on regression. Bad config detected at any stage is reverted without paging a human.

This is the same discipline Cloudflare already applied to software deploys on the Workers runtime; Snapstone ports it to the configuration plane across all teams — the move canonicalised as patterns/config-deployment-as-code-deployment.

Flexibility: configuration unit on demand

Snapstone's load-bearing product feature is flexibility:

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.

The consequence is strategic:

This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward — bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.

The operational workflow is: risk review surfaces a configuration pattern with unbounded blast radius → team onboards it as a Snapstone configuration unit → pattern inherits staged rollout + health gating + automated rollback without per-team implementation effort.

Canonical workloads it covers

Named instances from the 2026-05-01 post, both mapped directly to incidents:

  • Bot Management feature file (see systems/cloudflare-bot-management). The 2025-11-18 trigger: a ClickHouse permission migration caused the feature-file generator to emit a file with doubled rows, which propagated fleet-wide in seconds because the feature-file channel had no staged rollout. Under Snapstone, the doubled file would be detected as degraded during the first cohort's rollout and reverted before reaching production. See sources/2025-11-18-cloudflare-outage-on-november-18-2025.
  • Control flags in the global configuration system — e.g., the internal WAF testing-tool disable flag from 2025-12-05 that triggered a seven-year-old Lua nil-index bug in FL1's rulesets engine. Under Snapstone, the flag flip would be released gradually with health monitoring; the first cohort's 5xx spike would trigger automated rollback within a minute or two, not after the flag had hit 28% of traffic. See sources/2025-12-05-cloudflare-outage-on-december-5-2025.

The stated scope is not specific incidents — it's any configuration unit a team or risk review brings in.

Composes with the global configuration system

Snapstone is additive, not replacement. The rapid-delivery global configuration system still exists and is still required for genuine threat-response scenarios (DDoS mitigations, zero-day WAF rules, fleet-wide bad-IP blacklists) where a canary measured in minutes defeats the purpose of rapid response. Snapstone becomes the default for any configuration change that doesn't justify the rapid channel, and risk reviews now have a straightforward disposition for configuration units identified as dangerous: onboard to Snapstone.

See patterns/global-configuration-push for the antipattern framing of rapid-only; Snapstone is the structural compensating pattern along with patterns/global-feature-killswitch and patterns/harden-ingestion-of-internal-config.

Relationship to the Cloudflare remediation triad

The three stated remediation patterns from the 2025-11-18 and 2025-12-05 post-mortems — patterns/progressive-configuration-rollout, patterns/global-feature-killswitch, and patterns/harden-ingestion-of-internal-config — all have shipped-in-production instances after Code Orange. Snapstone is the system-tier realisation of the first of these; the other two compose onto it (killswitch is orthogonal to the rollout channel; ingest hardening is downstream of whatever channel delivers the config). Together with concepts/fail-stale (the preferred failure-mode default), the four patterns constitute Cloudflare's completed config-plane defense-in-depth posture.

What the post does not disclose

  • Internal architecture — storage substrate, orchestrator, deployment topology, rollout-stage granularity.
  • Health-signal sources — which metrics gate which stages; SLO wiring.
  • Rollback thresholds — error-rate / latency / other triggers; quiescence windows; automated-vs-human rollback disposition.
  • Onboarding mechanics — how a team declares a new configuration unit; what manifests / schemas / DSLs are involved.
  • Rollout-cadence defaults — cohort sizes, dwell times, per-cohort health-gate ceremonies.
  • Scale — how many configuration units are now onboarded; what % of configuration changes now flow through Snapstone vs the rapid channel.

Future Cloudflare posts (or deep-dives in the Code Orange lineage) would be the natural disclosure surface for these mechanisms.

Seen in

  • sources/2026-05-01-cloudflare-code-orange-fail-small-complete — canonical wiki instance; the system is introduced and named for the first time publicly in this post. Design contract + flexibility property + mapped incident workloads + compose- with-global-config-system framing all come from the post.
Last updated · 445 distilled / 1,275 read