Skip to content

CONCEPT Cited by 1 source

Killswitch subsystem

A killswitch subsystem is a mechanism inside a rules engine, feature-flag system, or policy engine that allows a specific rule / flag / policy to be disabled rapidly, ideally fleet- wide within seconds, without going through the normal change-management path.

The canonical shape:

  • The rule / flag / policy has a stable identifier.
  • A small, fast-propagating config channel (often a global configuration system) carries the disable signal.
  • The evaluator checks the killswitch state before running the rule; a set killswitch causes the evaluator to skip evaluating the rule.
  • Operators invoke the killswitch via a well-defined SOP (the ability to "mute" a misbehaving rule is the point).

Why it exists

Killswitches are the fast-rollback path of a rules / flags / policies system. If a newly-deployed rule or a freshly-tuned threshold is misbehaving — generating false positives, causing latency, or blocking legitimate traffic — operators need to disable that specific thing without reverting unrelated changes, without a CI/CD cycle, and often within a minute of detection. A killswitch is that path.

It's the specialisation of patterns/emergency-bypass to rules / flags / policies: faster than the normal change path, logged after-the-fact.

The dangerous-to-correctness corner: first-time paths

Because killswitches are rarely invoked in any given rule-times-rule-version product, and most rules never have their killswitch fired, the evaluator code path that runs when a killswitch is set is frequently less-tested than the main evaluation path. Some branches of the killswitch path may never have executed in production before the first time the killswitch is applied to a rule of a specific shape.

This is what happened on 2025-12-05 in Cloudflare's rulesets engine: the killswitch subsystem had been used many times before, just never on a rule with action=execute. When it finally was, an untested post-processing branch dereferenced a field that the rule-skipped branch had not populated — Lua nil-index exception, HTTP 500 for ~28% of Cloudflare traffic for ~25 minutes.

Generalizable design discipline

  • Treat the killswitch path as a hot path. Even if it runs rarely per rule, the N rules × M versions × ~always-off product still means the set-to-disabled branch runs on every request for every rule with the flag set. It is not a cold code path.
  • Invariants must hold in both set and unset states. Rule- result post-processing should not assume fields present only on evaluated rules; the skipped-rule result must be a valid input to every downstream stage.
  • Fuzz the killswitch-on dimension. Every product of (rule shape, action type) × (killswitch set / unset) should have coverage; any action type that's ever allowed in a rule should be tested with the killswitch on.
  • Fail-open under evaluator error. See concepts/fail-open-vs-fail-closed: an exception in the rules engine on a killswitched rule should not fail the request — it should degrade to "don't score this" or "apply a known-good default."
  • Stage killswitch propagation. Even "fast" doesn't have to mean "simultaneous." A killswitch that propagates in seconds to 1% of POPs, then 10%, then fleet-wide gives the evaluator error a chance to surface before it's 100%.

Canonical wiki instance

sources/2025-12-05-cloudflare-outage-on-december-5-2025 — Cloudflare's rulesets engine killswitch subsystem, used many times before without incident, ran its killswitch-on-action=execute path for the first time in production on 2025-12-05. A post-processing dereference of rule_result.execute.results (skipped → absent) produced a nil-index Lua exception on FL1. Rust FL2 was unaffected.

Seen in

Last updated · 200 distilled / 1,178 read