FLYIO 2024-03-12

Fly.io — JIT WireGuard¶

Summary¶

Fly.io's 2024-03-12 post on replacing push-based WireGuard peer provisioning with a Just-In-Time (JIT) pull-on-first-packet model on their fleet of gateway servers. Every flyctl invocation conjures a TCP/IP stack and speaks WireGuard to a regional gateway; previously, the gateway's wggwd daemon received peer configs pushed over NATS from the Fly GraphQL API and installed them in the Linux kernel via Netlink. Two operational failures drove the rewrite: (1) NATS dropped messages (no delivery guarantee), so a flyctl that received its peer config over GraphQL could hit a gateway on which the peer had not yet been installed; (2) CI-ephemeral peers never reconnect — stale peers accumulated to hundreds of thousands per gateway, which made kernel WireGuard operations (especially reload-on-reboot) pathologically slow and caused kernel panics.

The fix is architectural: gateways pull peer configs from the API only when a handshake arrives, install them in the kernel on demand, and aggressively cron-evict stale ones. The implementation pivots on a non-obvious primitive — Linux WireGuard Netlink has no "new-connection" event, so Fly manufactures one by sniffing handshake-initiation packets with a BPF filter (udp and dst port 51820 and udp[8] = 1), running enough of the Noise Protocol to decrypt the initiator's public key (since Noise hides identities), rate-limit-caching in SQLite, and calling an internal HTTP API to fetch + install the peer. A separate latency trick — install-and-initiate back — has the gateway respond to a fresh initiation by sending its own handshake once the peer finally lands in the kernel, because WireGuard is symmetric about which end initiates.

Key takeaways¶

WireGuard scales as far as the Linux kernel can hold peer state. "What you can't do is store them all in the Linux kernel." At Fly.io's fleet scale, stale peers reached the low hundreds of thousands per gateway and triggered kernel WireGuard slowness + kernel panics. The kernel is the capacity wall, not SQLite, and not the RDBMS tier. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
NATS lost messages; Fly.io is migrating internal control planes off it. "NATS is fast, but doesn't guarantee delivery. Back in 2022, Fly.io was pretty big on NATS internally. We've moved away from it." The specific failure mode on the WireGuard path was that flyctl received its peer config from the GraphQL API but the gateway hadn't installed it yet — because the push RPC was dropped. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Push-provisioning-then-never-cleaning was the structural problem; pull-on-demand was the structural fix. Migrating from "API pushes config to gateway" to "gateway pulls config when it sees a handshake" means stale peers can be evicted ruthlessly — on cron — because any evicted peer gets restored the next time its owner connects. This flips the question from "which peers do we enable in the kernel" to "no peers persist in the kernel beyond their active use window". Canonical pull-replacing-push instance. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Linux WireGuard Netlink has no "incoming connection" event; manufacture one via BPF packet-sniffing. The kernel interface exposes config RPCs, not connection-arrival notifications. Fly snatches handshake-initiation packets on the host with a packet socket + BPF filter: udp and dst port 51820 and udp[8] = 1. The WireGuard paper specifies the handshake-initiation type is a single plaintext byte, so the filter is a one-byte compare. For WebSockets-WireGuard traffic (Fly's default customer transport because "people who have trouble talking end-to-end in UDP"), they hook the WireSockets packet-receive function directly — same event semantics, different transport. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Noise's identity-hiding means parsing the initiator's public key requires running the handshake crypto. WireGuard is built on Trevor Perrin's Noise Protocol Framework, which "goes way out of its way to hide identities during handshakes". To identify who's connecting, Fly runs the first leg of the Noise handshake using the interface's private key (readable from Netlink by a privileged process). ~200 lines of code; not trivially free, but cheap enough to do per handshake. The reference implementation is published as a gist. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Rate-limited SQLite cache in front of the control-plane API. Decrypted public keys drive an internal HTTP API fetch to the control plane; a rate-limited SQLite cache on the gateway suppresses repeat lookups when WireGuard retries (which happens frequently because the first handshake often arrives before the peer is installed). Canonical concepts/rate-limited-cache on the control-plane-fetch side of a JIT provisioner. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Initiator/responder role inversion is a sub-RTT install trick. WireGuard is point-to-point — "It's a pure point-to-point protocol; peers connect to each other when they have traffic to send. The first peer to connect is called the initiator, and the peer it connects to is the responder." After sniffing an initiation, Fly has the initiator's 4-tuple including the ephemeral source port. Once the peer is installed, the gateway installs itself as the initiator and has the kernel originate a WireGuard handshake back to flyctl. "This works; the protocol doesn't care a whole lot who's the server and who's the client." Net effect: the connection comes up as fast as install-time allows, instead of waiting for flyctl's next retry. A Jason Donenfeld tip — credited inline in the post. (Source: sources/2024-03-12-flyio-jit-wireguard-peers)
Outcome: hundreds of thousands of stale peers → "rounds to none". The accompanying Grafana chart of kernel_stale_wg_peer_count across all gateways shows flat baselines "between 0 and 50,000" with a topline near "just under 550,000", then — as each gateway is flipped over to JIT — "each line in turn jumps sharply down to the bottom" until all datapoints are indistinguishable from zero. Gateways now "hold a lot less state, are faster at setting up peers, and can be rebooted without having to wait for many unused peers to be loaded back into the kernel." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

Architectural details¶

Before (push-based)¶

flyctl background agent generates a fresh WireGuard peer configuration (pubkey + address) from the Fly GraphQL API.
The API forwards the peer config to the appropriate regional gateway (e.g. ord for Chicago) via a NATS RPC.
wggwd on the gateway saves the config to SQLite and installs it in the kernel via Netlink (using WireGuard's Go libraries, i.e. wgctrl-go).
wggwd ACKs the API; API replies to flyctl.
flyctl connects; the kernel has the peer because the ACK implies it.

Failure mode 1. NATS drops the push. flyctl gets a valid GraphQL response; the gateway has never heard of the peer; the connection stalls.

Failure mode 2. CI-ephemeral flyctl invocations never reuse their peer. Stale peers accumulate in the kernel. Scale eats performance and reliability:

"High stale peer count made kernel WireGuard operations very slow — especially loading all the peers back into the kernel after a gateway server reboot — as well as some kernel panics."

After (JIT, pull-on-first-packet)¶

A handshake-initiation packet arrives at the gateway (either as a raw UDP packet, or inside a WebSocket frame for WireGuard-over-WebSockets customers).
A BPF filter (udp and dst port 51820 and udp[8] = 1) on a packet socket catches it; WebSocket-delivered packets go through a hook in the WireSockets daemon.
Run the first leg of the Noise handshake using the interface private key (fetched from Netlink; privileged process only) to decrypt the initiator's static public key. "The code to do this is fussy, but it's relatively short (about 200 lines)."
Rate-limited SQLite cache lookup on the pubkey. If hit, done; if miss, make an internal HTTP API request to fetch the peer config for that pubkey.
Install the peer via Netlink, using the 4-tuple the initiation packet gave us (including flyctl's ephemeral source port), in the initiator role. Kernel originates a WireGuard handshake back to flyctl on that 4-tuple. Jason Donenfeld tip: "the protocol doesn't care a whole lot who's the server and who's the client."
Stale peers are evicted by a cron job — cheap now because any evicted peer that's still wanted comes right back via step 1.

Why the role-inversion trick matters¶

Without it, step 4's API fetch races flyctl's WireGuard retry timer. WireGuard retries fast, so correctness isn't at stake; but the first successful handshake is delayed by at least one retry interval. Installing the peer as initiator means the kernel sends the next handshake packet itself, the instant the install lands — "as fast as they can possibly be installed."

What made this possible¶

Noise's identity-hiding is a feature, but it cost ~200 lines of crypto code on the gateway. Any simpler protocol (where the handshake carries the identity in plaintext) wouldn't have needed this step.
Netlink exposes the interface private key to privileged processes. Without that, the gateway couldn't decrypt the handshake initiator it's receiving.
WireGuard-over-WebSockets was already the customer default. "People who have trouble talking end-to-end in UDP" — so most traffic was already flowing through a Fly-controlled WireSockets daemon that could hook packet receive.

Numbers disclosed¶

Stale peers per gateway: baseline 0–50,000; topline ~550,000 pre-fix. Grafana chart reading, qualitative — no exact pre/post fleet-wide sum.
Post-fix: "rounds to none." Indistinguishable from zero on the same chart axis.
Noise-handshake decode code size: ~200 lines.
BPF filter: udp and dst port 51820 and udp[8] = 1 — three primitives + one byte compare.
"A few weeks" of production. Grafana chart is from "the day of the switchover." No long-window stability claim.
WireGuard UDP port: 51820 (the protocol default).

Numbers not disclosed¶

No per-gateway CPU / memory deltas (before/after).
No Netlink install latency numbers for JIT installs.
No absolute handshake-rate numbers on the gateway fleet.
No reboot-duration numbers before/after (the post implies long-before and short-after but publishes no concrete time).
No throughput impact on live WireGuard traffic from the BPF sniffer.
No disclosed cap on cache eviction rate.
No named cron cadence for peer eviction.
No kernel-panic incident count or specific CVE/bug references.

Caveats¶

Qualitative post overall. The Grafana chart is the only quantitative artefact; everything else is narrative.
Borderline Tier-3 scope. Passes the Fly.io Tier-3 filter (AGENTS.md) because the whole post is production-infrastructure architecture — kernel interactions, NATS retirement, BPF-level data-plane work, protocol-level role inversion, production incident (kernel panics from stale peers) — not product PR.
Reference implementation partial. The Noise-unwrap code is published as a gist; the JIT daemon code itself is not open source.
Specific to Linux WireGuard + Fly's gateway topology. The pattern generalises (pull-on-first-packet, BPF-event-source, role inversion) but the exact Netlink-and-Noise recipe is Linux-kernel-WireGuard shaped.
Farewell note: the post's author, Lillian, is leaving Fly.io — flagged in the editor's note; irrelevant to the architecture but worth recording as wiki provenance.

Relationship to existing wiki¶

Siblings at Fly.io: Fly Kubernetes does more now uses the internal systems/fly-wireguard-mesh|6PN WireGuard mesh as the CNI substitute — that mesh is a different WireGuard substrate (Fly Machine ↔ Fly Machine, internal, always-on) than the external customer-facing gateway mesh this post describes (flyctl ↔ gateway, transient, per-CI-job). Tigris is unrelated at the substrate level but shares the Fly.io platform scope.
Cross-company sibling on "pull-replacing-push at scale": the FKS beta post's "translate K8s primitives into existing Fly.io primitives rather than reimplementing them" framing shares the same cost-minimising impulse — don't run the thing you don't have to run.
concepts/packet-sniffing-as-event-source is a novel wiki concept introduced by this ingest — no prior Fly / Cloudflare / etc. wiki page covers "use BPF on the data plane to manufacture a control-plane event your primitive doesn't expose."
patterns/initiator-responder-role-inversion is a novel wiki pattern — no prior page captures "install the peer in the opposite role from natural to cut a handshake RTT."
concepts/noise-protocol is a new wiki concept — prior wiki coverage of crypto handshakes was TLS-shaped; Noise's identity-hiding discipline is a distinct design axis.

Source¶

Original: https://fly.io/blog/jit-wireguard-peers/
Raw markdown: raw/flyio/2024-03-12-jit-wireguard-85220ba4.md

companies/flyio — publisher (Tier-3 platform blog).
systems/wireguard — underlying protocol + kernel subsystem.
systems/fly-gateway — the daemon fleet this post is about.
systems/wggwd — the gateway-side WireGuard manager, pre- and post-JIT.
systems/linux-netlink — kernel config RPC used for install + key extraction.
systems/nats — dropped-from-the-WireGuard-path NATS reference.
concepts/jit-peer-provisioning — the central concept.
concepts/noise-protocol — the identity-hiding handshake framework.
concepts/wireguard-handshake — the specific wire format this post sniffs + decodes.
concepts/packet-sniffing-as-event-source — the data-plane-to-control-plane event bridge.
concepts/rate-limited-cache — the SQLite-backed lookup cache on the gateway.
concepts/kernel-state-capacity-limit — why "store everything in the kernel" didn't scale.
concepts/identity-hiding-handshake — the property that makes JIT-identify non-trivial.
concepts/kernel-panic-from-scale — production incident class.
patterns/jit-provisioning-on-first-packet — the reusable architectural move.
patterns/initiator-responder-role-inversion — the sub-RTT install trick.
patterns/bpf-filter-for-api-event-source — the sniff-to-event pattern.
patterns/pull-on-demand-replacing-push — the system-shape rewrite.
patterns/state-eviction-cron — the cheap-because-pull cleanup pattern.