Skip to content

SYSTEM Cited by 5 sources

Fly Proxy

Fly Proxy is Fly.io's in-house proxy — the shared edge- and private-network proxy fabric backing Fly Machines' ingress, load balancing, and internal *.flycast private endpoints.

Role in FKS

Under FKS, the Fly Proxy is the implementation backing Kubernetes Service objects. Per the primitive-mapping table in the FKS beta post:

Services → The Fly Proxy

When kubectl expose pod ... --port=8080 runs, FKS creates a standard ClusterIP Service object with an IPv6 ClusterIP (e.g. fdaa:0:48c8:0:1::1a). Service annotations expose the internal plumbing: fly.io/clusterip-allocator: configured and service.fly.io/sync-version: <n> — a bespoke allocator / sync loop on FKS's side reconciles K8s Service objects into the Fly Proxy's routing table.

The three ways to reach the Service described in the post — direct IPv6, flycast hostname, and in-cluster CoreDNS — all land on the Fly Proxy for the actual forwarding step.

Machine lifecycle — autostart / autostop

Fly Proxy owns the start/stop lifecycle of Fly Machines when autostart / autostop is enabled on a Fly App. The proxy detects inbound requests for a stopped Machine and starts it; it detects configured idle silence on a running Machine and stops it. The app tier is unaware.

This is the load-bearing primitive for GPU cost control on Fly.io: a GPU Machine left running costs serious money, but the Fly Proxy can stop it during silence and resume it on the next internal request — see the autostart-and-stop docs. The pattern generalises beyond GPU to any expensive-to-run Machine whose workload tolerates a cold-start on the first request after idle.

Canonical instance on this wiki: patterns/proxy-autostop-for-gpu-cost-control — Fly Proxy stopping an Ollama Machine behind Flycast so it incurs no cost during silence and wakes on internal requests from the PocketBase app tier. "If there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)

Seen in

Anycast router architecture — the state-distribution problem

Per the 2025-05-28 parking_lot debugging retrospective, the load-bearing difficulty of fly-proxy is not the proxy logic per se — it's managing millions of connections for millions of apps across 30+ regions where Machines can start in <1 s and terminate instantly:

"This is the hard problem: managing millions of connections for millions of apps. It's a lot of state to manage, and it's in constant flux. We refer to this as the 'state distribution problem', but really, it quacks like a routing protocol." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)

RIB and FIB

Analogous to a network router's RIB (system of record) and FIB (fast-forwarding table), fly-proxy pairs:

  • Corrosion2 — global CRDT-SQLite SWIM-gossip state (the RIB); updates propagate host-to-host in millisecond intervals.
  • The Catalogfly-proxy's in-memory aggregation of Corrosion state for fast routing decisions (the FIB): "a record of everything in Corrosion a proxy might need to know about to forward requests."

Regionalization

Historically the Catalog was a global broadcast domain — every proxy received updates for every Fly Machine. The 2024 outage (the if let read-lock-over-both-arms bug) was triggered by an update about an app nobody used, which had no business reaching most proxies. Fly is mid-migration to a regionalized Catalog where most updates stay within the region. "Why? Because it scales, and fixing it turns out to be a huge lift. It's a lift we're still making!"

Catalog RWLock + Rust concurrency stack

The Catalog is protected by parking_lot RWLocks. Readers (request-forwarding threads) dominate; writers (Corrosion update consumers) arrive on state updates.

2024 outage classif let read-guard over both if and else arms: canonical if-let-lock-scope-bug instance. Fixed by eliminating if let-over-lock shapes fleet-wide.

2024 safety net — Watchdog on an internal REPL control channel kills the proxy when the channel becomes nonresponsive (deadlock / dead-loop / exhaustion). Canonical REPL-channel liveness probe + watchdog-bounce instance. Core dump collected on every kill.

2025 classparking_lot try_write_for timeout + reader-release wake-path race caused a bitwise double-free that corrupted the lock word to 0xFFFFFFFFFFFFFFFF and produced artificial deadlocks (every thread waiting; no owner). Fixed upstream: parking_lot PR #466 — Fly.io's sixth upstream-the-fix instance on the wiki.

Permanent improvements from the 2025 incident

Seen in (additional)

  • sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Primary architectural source on fly-proxy's role as Fly.io's Anycast router (complement to the edge / service-discovery role surfaced by earlier Fly Kubernetes and Ollama posts). Documents the RIB/FIB split with Corrosion, the state-distribution problem as the load-bearing difficulty, the 2024 global Anycast outage from an if let read-lock-scope bug, the watchdog + REPL-channel safety net that made deadlocks nonlethal, the regionalization effort to confine routing updates to regions, the lazy-loading Catalog refactor that exposed the 2025 parking_lot bug, and the RWLock-contention / lock-word-corruption debugging arc — canonical instance of patterns/upstream-the-fix (Fly.io's second Rust upstream) + concepts/descent-into-madness-debugging.

  • sources/2025-10-22-flyio-corrosion — confirmation of the RIB/FIB framing from the RIB side. The 2024-09-01 contagious deadlock in fly-proxy is the framing outage of the canonical Corrosion post; the fix catalogue (Tokio watchdogs fleet-wide, Antithesis investment, whole-object republish, regionalization) directly benefits fly-proxy's consumer-side reliability. fly-proxy is the canonical FIB consumer of the RIB that post describes.

Last updated · 200 distilled / 1,178 read