Skip to content

PATTERN Cited by 1 source

Global feature killswitch

An orthogonal fast-off lever that can disable a feature / module / subsystem globally in seconds — independent of the config-delivery pipeline that feeds that feature. If a bad config file has poisoned a module, the killswitch takes the module out of the hot path without waiting to clean up the config.

The pattern

  • Every feature / module that runs on the hot path has an explicit enable flag read at request time.
  • The enable flag is itself distributed through a global configuration system with seconds-to-fleet propagation — same speed as threat-response config but separate from the config that parameterizes the feature.
  • Flipping the enable flag off causes the module to skip — serve the request as if the module weren't installed (fail open).
  • Well-defined SOP for invocation; invocation is audited.

Why you need it on top of progressive rollout

  • Progressive configuration rollout bounds how fast bad config reaches the fleet.
  • Global feature killswitch bounds how fast a feature consuming bad config can be removed from the hot path once the problem is identified.

The two patterns compose: progressive rollout prevents most incidents; killswitch bounds the incidents it doesn't prevent.

Canonical instance

Cloudflare's stated 2025-11-18 remediation project #2: "Enabling more global kill switches for features." From sources/2025-11-18-cloudflare-outage-on-november-18-2025 — the feature file was regenerated every 5 minutes and propagated fleet-wide; by the time the team identified Bot Management as the cause (13:37 UTC), there was no orthogonal way to take the bots module off the hot path other than inserting a known-good config file + restarting the core proxy. A module-level killswitch would have shortened the outage window by hours.

Cloudflare already has a killswitch subsystem for the rulesets engine (see concepts/killswitch-subsystem) — the 2025-12-05 incident is itself an example of the rulesets killswitch working correctly to disable the internal WAF test-rules sub-ruleset. The 2025-11-18 remediation is about extending the pattern to more features — the Bot Management module didn't have one.

Invocation discipline

  • Exercised regularly (chaos engineering / game days) so the operator muscle memory is there when you need it.
  • Exercise itself does not cause customer impact — killswitch off + resume + killswitch on should be a no-op for well-designed fail-open modules.
  • Not-yet-exercised code paths are latent hazards (see the concepts/latent-misconfiguration|12-05 incident which detonated on a never-before-used killswitch code path).

Seen in

Last updated · 200 distilled / 1,178 read