SYSTEM Cited by 6 sources

Fly Proxy¶

Fly Proxy is Fly.io's in-house proxy — the shared edge- and private-network proxy fabric backing Fly Machines' ingress, load balancing, and internal *.flycast private endpoints.

Role in FKS¶

Under FKS, the Fly Proxy is the implementation backing Kubernetes Service objects. Per the primitive-mapping table in the FKS beta post:

Services → The Fly Proxy

When kubectl expose pod ... --port=8080 runs, FKS creates a standard ClusterIP Service object with an IPv6 ClusterIP (e.g. fdaa:0:48c8:0:1::1a). Service annotations expose the internal plumbing: fly.io/clusterip-allocator: configured and service.fly.io/sync-version: <n> — a bespoke allocator / sync loop on FKS's side reconciles K8s Service objects into the Fly Proxy's routing table.

The three ways to reach the Service described in the post — direct IPv6, flycast hostname, and in-cluster CoreDNS — all land on the Fly Proxy for the actual forwarding step.

Machine lifecycle — autostart / autostop¶

Fly Proxy owns the start/stop lifecycle of Fly Machines when autostart / autostop is enabled on a Fly App. The proxy detects inbound requests for a stopped Machine and starts it; it detects configured idle silence on a running Machine and stops it. The app tier is unaware.

This is the load-bearing primitive for GPU cost control on Fly.io: a GPU Machine left running costs serious money, but the Fly Proxy can stop it during silence and resume it on the next internal request — see the autostart-and-stop docs. The pattern generalises beyond GPU to any expensive-to-run Machine whose workload tolerates a cold-start on the first request after idle.

Canonical instance on this wiki: patterns/proxy-autostop-for-gpu-cost-control — Fly Proxy stopping an Ollama Machine behind Flycast so it incurs no cost during silence and wakes on internal requests from the PocketBase app tier. "If there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)

Seen in¶

[[sources/2026-01-14-flyio-the-design-implementation-of- sprites]] — Fly Proxy as the Anycast ingress substrate for Sprite HTTPS URLs. "When you ask the Sprite API to make a public URL for your Sprite, we generate a Corrosion update that propagates across our fleet instantly. Your application is then served, with an HTTPS URL, from our proxy edges." Same proxy-and-Corrosion substrate Fly Machines already use — Sprites get per-VM Anycast URLs without a new routing layer. Combined with the inside-out orchestration story, it means port-forwarding to outside the Sprite is done in the VM's root namespace ("if you bind a socket to *:8080, we'll make it available outside the Sprite — yep, that's in the root namespace too") and the proxy edges pick up the Corrosion catalog entry from gossip — no central routing dispatcher for Sprite URLs.
sources/2024-03-07-flyio-fly-kubernetes-does-more-now — named as the "Service" implementation under FKS; the target of FKS's Service-sync reconciler; and the backing for flycast-based private load balancing.
sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description — Fly Proxy as the autostart/autostop controller for a GPU Ollama Machine. The cold-start tail the proxy chooses to accept (~45 s on a100-40gb + LLaVA-34b) is the trade the GPU-cost-control pattern makes.
sources/2025-04-08-flyio-our-best-customers-are-now-robots — Fly Proxy's dynamic request routing as a robot attractant for MCP SSE workloads. Canonical wiki statement: "More recent MCP flows involve repeated and potentially long-lived (SSE) connections. To make this work in a multitenant environment, you want these connections to hit the same (stateful) instance. So we think it's possible that the control we give over request routing is a robot attractant." Fly's dynamic request routing lets the tenant pin connection-level affinity per client, which is exactly what multitenant MCP-SSE deployments need (long-lived SSE session affinity). Canonical instance of patterns/session-affinity-for-mcp-sse and an RX data point on the networking axis.

Anycast router architecture — the state-distribution problem¶

Per the 2025-05-28 parking_lot debugging retrospective, the load-bearing difficulty of fly-proxy is not the proxy logic per se — it's managing millions of connections for millions of apps across 30+ regions where Machines can start in <1 s and terminate instantly:

"This is the hard problem: managing millions of connections for millions of apps. It's a lot of state to manage, and it's in constant flux. We refer to this as the 'state distribution problem', but really, it quacks like a routing protocol." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

RIB and FIB¶

Analogous to a network router's RIB (system of record) and FIB (fast-forwarding table), fly-proxy pairs:

Corrosion2 — global CRDT-SQLite SWIM-gossip state (the RIB); updates propagate host-to-host in millisecond intervals.
The Catalog — fly-proxy's in-memory aggregation of Corrosion state for fast routing decisions (the FIB): "a record of everything in Corrosion a proxy might need to know about to forward requests."

Regionalization¶

Historically the Catalog was a global broadcast domain — every proxy received updates for every Fly Machine. The 2024 outage (the if let read-lock-over-both-arms bug) was triggered by an update about an app nobody used, which had no business reaching most proxies. Fly is mid-migration to a regionalized Catalog where most updates stay within the region. "Why? Because it scales, and fixing it turns out to be a huge lift. It's a lift we're still making!"

Catalog RWLock + Rust concurrency stack¶

The Catalog is protected by parking_lot RWLocks. Readers (request-forwarding threads) dominate; writers (Corrosion update consumers) arrive on state updates.

2024 outage class — if let read-guard over both if and else arms: canonical if-let-lock-scope-bug instance. Fixed by eliminating if let-over-lock shapes fleet-wide.

2024 safety net — Watchdog on an internal REPL control channel kills the proxy when the channel becomes nonresponsive (deadlock / dead-loop / exhaustion). Canonical REPL-channel liveness probe + watchdog-bounce instance. Core dump collected on every kill.

2025 class — parking_lot try_write_for timeout + reader-release wake-path race caused a bitwise double-free that corrupted the lock word to 0xFFFFFFFFFFFFFFFF and produced artificial deadlocks (every thread waiting; no owner). Fixed upstream: parking_lot PR #466 — Fly.io's sixth upstream-the-fix instance on the wiki.

Permanent improvements from the 2025 incident¶

All RAII lock guards replaced with explicit closures (see patterns/raii-to-explicit-closure-for-lock-visibility).
All Catalog write locks use try_write_for bounded acquisition with labeled-log + metric telemetry (see patterns/lock-timeout-for-contention-telemetry).
Last-and-current writer identity tracked with context information (app IDs, call-site). "Next time we have a deadlock, we should have all the information we need to identify the actors without gdb stack traces."

Seen in (additional)¶

sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Primary architectural source on fly-proxy's role as Fly.io's Anycast router (complement to the edge / service-discovery role surfaced by earlier Fly Kubernetes and Ollama posts). Documents the RIB/FIB split with Corrosion, the state-distribution problem as the load-bearing difficulty, the 2024 global Anycast outage from an if let read-lock-scope bug, the watchdog + REPL-channel safety net that made deadlocks nonlethal, the regionalization effort to confine routing updates to regions, the lazy-loading Catalog refactor that exposed the 2025 parking_lot bug, and the RWLock-contention / lock-word-corruption debugging arc — canonical instance of patterns/upstream-the-fix (Fly.io's second Rust upstream) + concepts/descent-into-madness-debugging.
sources/2025-10-22-flyio-corrosion — confirmation of the RIB/FIB framing from the RIB side. The 2024-09-01 contagious deadlock in fly-proxy is the framing outage of the canonical Corrosion post; the fix catalogue (Tokio watchdogs fleet-wide, Antithesis investment, whole-object republish, regionalization) directly benefits fly-proxy's consumer-side reliability. fly-proxy is the canonical FIB consumer of the RIB that post describes.