CONCEPT Cited by 1 source
Killswitch subsystem¶
A killswitch subsystem is a mechanism inside a rules engine, feature-flag system, or policy engine that allows a specific rule / flag / policy to be disabled rapidly, ideally fleet- wide within seconds, without going through the normal change-management path.
The canonical shape:
- The rule / flag / policy has a stable identifier.
- A small, fast-propagating config channel (often a global configuration system) carries the disable signal.
- The evaluator checks the killswitch state before running the rule; a set killswitch causes the evaluator to skip evaluating the rule.
- Operators invoke the killswitch via a well-defined SOP (the ability to "mute" a misbehaving rule is the point).
Why it exists¶
Killswitches are the fast-rollback path of a rules / flags / policies system. If a newly-deployed rule or a freshly-tuned threshold is misbehaving — generating false positives, causing latency, or blocking legitimate traffic — operators need to disable that specific thing without reverting unrelated changes, without a CI/CD cycle, and often within a minute of detection. A killswitch is that path.
It's the specialisation of patterns/emergency-bypass to rules / flags / policies: faster than the normal change path, logged after-the-fact.
The dangerous-to-correctness corner: first-time paths¶
Because killswitches are rarely invoked in any given rule-times-rule-version product, and most rules never have their killswitch fired, the evaluator code path that runs when a killswitch is set is frequently less-tested than the main evaluation path. Some branches of the killswitch path may never have executed in production before the first time the killswitch is applied to a rule of a specific shape.
This is what happened on 2025-12-05 in
Cloudflare's rulesets
engine: the killswitch subsystem had been used many times
before, just never on a rule with action=execute. When it
finally was, an untested post-processing branch dereferenced a
field that the rule-skipped branch had not populated —
Lua nil-index exception, HTTP
500 for ~28% of Cloudflare traffic for ~25 minutes.
Generalizable design discipline¶
- Treat the killswitch path as a hot path. Even if it runs
rarely per rule, the
N rules × M versions × ~always-offproduct still means the set-to-disabled branch runs on every request for every rule with the flag set. It is not a cold code path. - Invariants must hold in both set and unset states. Rule- result post-processing should not assume fields present only on evaluated rules; the skipped-rule result must be a valid input to every downstream stage.
- Fuzz the killswitch-on dimension. Every product of
(rule shape, action type) × (killswitch set / unset)should have coverage; any action type that's ever allowed in a rule should be tested with the killswitch on. - Fail-open under evaluator error. See concepts/fail-open-vs-fail-closed: an exception in the rules engine on a killswitched rule should not fail the request — it should degrade to "don't score this" or "apply a known-good default."
- Stage killswitch propagation. Even "fast" doesn't have to mean "simultaneous." A killswitch that propagates in seconds to 1% of POPs, then 10%, then fleet-wide gives the evaluator error a chance to surface before it's 100%.
Canonical wiki instance¶
sources/2025-12-05-cloudflare-outage-on-december-5-2025 —
Cloudflare's rulesets engine killswitch subsystem, used many
times before without incident, ran its
killswitch-on-action=execute path for the first time in
production on 2025-12-05. A post-processing dereference of
rule_result.execute.results (skipped → absent) produced a
nil-index Lua exception on FL1. Rust FL2 was unaffected.
Seen in¶
- sources/2025-12-05-cloudflare-outage-on-december-5-2025 — canonical instance; the killswitch itself worked correctly (skipped the rule), but the post-evaluation code assumed the rule had been evaluated.