CONCEPT Cited by 1 source
Kernel panic from scale¶
A kernel panic from scale is the production failure mode where kernel code paths that work correctly at small / moderate state sizes trigger a panic once per-subsystem state grows beyond an untested-in-production size. The code didn't change; the state distribution did.
Signature¶
- Panic traces consistently point to the same subsystem.
- Panic cadence correlates with state size (peer count, connection count, table size), not with load spikes.
- Post-mortem reveals slow codepaths in that subsystem that are quadratic or that iterate data structures without bounds checks tuned for the production scale.
- Smaller hosts don't hit it; fleet subset at the largest state sizes does.
Canonical wiki instance¶
Fly.io's WireGuard gateways ran into kernel panics as stale peer counts approached hundreds of thousands per host:
"The high stale peer count made kernel WireGuard operations very slow — especially loading all the peers back into the kernel after a gateway server reboot — as well as some kernel panics." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
The panics are a symptom of concepts/kernel-state-capacity-limit — the kernel holding more of a thing than anyone ever tested.
Design response is the same as for kernel-state-capacity limits¶
Keep the kernel state-hot-set small. See concepts/jit-peer-provisioning for the specific-to-WireGuard worked example — the JIT rewrite's secondary benefit (alongside the no-more-delivery-guarantee-problem on the push path) is that panics stopped because the kernel no longer held enough peers to hit the pathological codepaths.
Contrast¶
- Kernel panic from a code bug — fixed by code change, unrelated to load.
- Kernel panic from hardware — ECC, MCE, driver, not state-size correlated.
- Kernel panic from scale — fixed by keeping the data distribution inside the range the code actually works for, usually by design changes in the user-space system that feeds the kernel.
Seen in¶
- sources/2024-03-12-flyio-jit-wireguard-peers — canonical wiki instance.
Related¶
- concepts/kernel-state-capacity-limit — the underlying cause.
- concepts/jit-peer-provisioning — the design response in the Fly.io instance.
- systems/wireguard — the specific kernel subsystem at fault in the Fly.io instance.
- systems/fly-gateway — the production host.