CONCEPT Cited by 1 source

JIT peer provisioning¶

Just-in-time (JIT) peer provisioning is the architectural move of not installing a network peer in a kernel / data-plane primitive until the first packet from that peer arrives, and evicting stale peers aggressively once they've been idle.

Definition¶

Given a multi-tenant networking primitive (WireGuard, IPsec, whatever), the classical design is push: some control plane installs every legitimate peer into every host that might serve it, ahead of time. This works up to the point where the kernel / data-plane primitive hits its state-capacity wall — slow reloads, slow config operations, eventually kernel panics. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

JIT peer provisioning inverts this:

Don't push. The control plane holds peer configs authoritatively; the data plane holds zero peers initially.
Catch first-packet. The data plane exposes or synthesises a "connection attempt arrived" event — via a native subscription API if available, or by sniffing the data-plane stream otherwise (see patterns/bpf-filter-for-api-event-source).
Pull. On that event, identify the peer (in WireGuard's case this requires unwrapping a Noise handshake because Noise hides identities; in simpler protocols, the identity is cleartext), call the control plane's internal API for that peer's config, and install it in the kernel.
Evict aggressively. Idle peers are pulled from the kernel by a cron job; next connection re-installs them via steps 2–3.

Why it matters¶

Kernel state is a capacity wall¶

"What you can't do is store them all in the Linux kernel." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

Push-based designs scale peer count until the kernel can no longer hold them efficiently. At Fly.io's gateway scale that was at the low hundreds of thousands of stale peers per host, and manifested as:

Slow WireGuard kernel operations overall.
Pathologically slow reload-on-reboot (peers must be reloaded into the kernel before traffic can resume).
Kernel panics.

A user-space store like SQLite has no such wall — "you could store every WireGuard peer everybody has ever used at Fly.io in a single SQLite database, easily."

Eviction is free¶

Under push, evicting a peer costs the next reconnect — the peer will never re-push, so eviction is destructive. Under JIT, eviction is cheap because the next connection pulls the peer back automatically. This flips eviction from "destructive action requiring careful policy" to "cron-frequency housekeeping."

Control-plane delivery-guarantee becomes irrelevant¶

The Fly.io failure mode was that NATS dropped the peer-install RPC between the API and the gateway. Under JIT, there is no push, so no dropped-push failure mode. The pull is idempotent and retries naturally (WireGuard retries handshakes).

Key challenges¶

Native event surface may be absent. Linux WireGuard Netlink has no "handshake arrived" event. Manufacturing one is a data-plane-introspection problem — BPF filter on the packet stream.
Identity may be crypto-hidden. WireGuard (Noise) hides the initiator's public key from wire observers. Unwrapping requires the interface private key + a Noise first-leg implementation (~200 lines at Fly). Cheap per-connection, but not free.
Install latency races the first retry. The first handshake arrives before the peer is installed; install must finish before the next client-side retry for a sub-RTT connect. Fly accelerates this further with patterns/initiator-responder-role-inversion: install the peer in the initiator role so the kernel sends the next handshake back to the client at install-time, rather than waiting for the next retry.
Rate-limit / cache control-plane lookups. A concepts/rate-limited-cache on the gateway is essential — otherwise retry storms translate into API storms.
Eviction policy. Aggressive eviction is the point, but thrashing (evict, immediately re-install for active peer) is a waste of install latency. Fly.io runs a cron — cadence not disclosed — and accepts the occasional re-install.

Seen in¶

sources/2024-03-12-flyio-jit-wireguard-peers — canonical wiki instance; Fly.io gateway fleet; pre-JIT state wall of ~550k stale peers per gateway, post-JIT "rounds to none."

systems/wireguard — the kernel substrate this concept is the scaling pattern for.
systems/fly-gateway — the canonical production host.
systems/wggwd — the daemon implementation.
systems/linux-netlink — the install RPC surface.
concepts/kernel-state-capacity-limit — the underlying problem this concept addresses.
concepts/rate-limited-cache — required control-plane shield.
patterns/jit-provisioning-on-first-packet — the architectural pattern.
patterns/pull-on-demand-replacing-push — the broader system-shape move.
patterns/state-eviction-cron — the cleanup primitive that becomes cheap under JIT.