Skip to content

SYSTEM Cited by 1 source

Fly gateway

Fly gateways are a fleet of dozens of servers around the world whose sole job is to accept incoming WireGuard connections from flyctl (and other external clients) and connect them to the appropriate private networks inside Fly.io.

"We operate dozens of 'gateway' servers around the world, whose sole purpose is to accept incoming WireGuard connections and connect them to the appropriate private networks." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

Gateways are regional (ord for Chicago, and so on). The stack on each gateway:

  • Linux kernel with WireGuard enabled + Netlink config surface.
  • wggwd — the Fly-authored gateway-side WireGuard manager daemon.
  • SQLite — local peer-config store and, in the JIT design, the rate-limit cache for API lookups.
  • WireSockets daemon — terminates the WireGuard-over-WebSockets transport Fly.io defaults customers to, so packets arrive at the gateway regardless of the customer's end-to-end UDP path.

Role under JIT WireGuard (2024-03-12)

Post-JIT-rewrite, the gateway's role inverts from receiver of peer pushes to pull-on-first-packet owner:

  1. Sniff handshake-initiation packets on the data plane via a BPF filter (udp and dst port 51820 and udp[8] = 1) — or equivalent hook inside the WireSockets daemon for WebSocket-delivered traffic.
  2. Decrypt the initiator's static public key by running the first leg of the Noise handshake (requires the interface private key, fetched from Netlink — privileged process only, ~200 lines of code).
  3. Consult a rate-limited SQLite cache on the pubkey; on miss, make an internal HTTP API request to the Fly control plane for the peer config.
  4. Install the peer via Netlink, in the initiator role, so the kernel originates a WireGuard handshake back to flyctl immediately (canonical role inversion).
  5. A cron job aggressively removes stale peers from the kernel — cheap because the next connection will re-pull the peer anyway.

Historical role (pre-JIT)

"Until a few weeks ago, our gateways ran on a pretty simple system." — push-based: the Fly GraphQL API forwarded every new flyctl-generated peer config to the appropriate gateway over NATS; wggwd installed it and never cleaned it up. Two failure modes drove the rewrite: NATS dropped messages (so install races the GraphQL reply) and stale peers accumulated to the low hundreds of thousands per host (slow kernel, kernel panics). (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

Seen in

Last updated · 200 distilled / 1,178 read