SYSTEM Cited by 5 sources
Fly Proxy¶
Fly Proxy is Fly.io's in-house proxy — the
shared edge- and private-network proxy fabric backing Fly Machines'
ingress, load balancing, and internal *.flycast private endpoints.
Role in FKS¶
Under FKS, the Fly Proxy is the
implementation backing Kubernetes Service objects. Per the
primitive-mapping table in
the FKS beta
post:
Services → The Fly Proxy
When kubectl expose pod ... --port=8080 runs, FKS creates a standard
ClusterIP Service object with an IPv6 ClusterIP (e.g.
fdaa:0:48c8:0:1::1a). Service annotations expose the internal
plumbing: fly.io/clusterip-allocator: configured and
service.fly.io/sync-version: <n> — a bespoke allocator / sync loop
on FKS's side reconciles K8s Service objects into the Fly Proxy's
routing table.
The three ways to reach the Service described in the post — direct IPv6, flycast hostname, and in-cluster CoreDNS — all land on the Fly Proxy for the actual forwarding step.
Machine lifecycle — autostart / autostop¶
Fly Proxy owns the start/stop lifecycle of Fly Machines when autostart / autostop is enabled on a Fly App. The proxy detects inbound requests for a stopped Machine and starts it; it detects configured idle silence on a running Machine and stops it. The app tier is unaware.
This is the load-bearing primitive for GPU cost control on Fly.io: a GPU Machine left running costs serious money, but the Fly Proxy can stop it during silence and resume it on the next internal request — see the autostart-and-stop docs. The pattern generalises beyond GPU to any expensive-to-run Machine whose workload tolerates a cold-start on the first request after idle.
Canonical instance on this wiki: patterns/proxy-autostop-for-gpu-cost-control — Fly Proxy stopping an Ollama Machine behind Flycast so it incurs no cost during silence and wakes on internal requests from the PocketBase app tier. "If there haven't been any requests for a few minutes, the Fly Proxy stops the Ollama Machine, which releases the CPU, GPU, and RAM allocated to it." (Source: sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description)
Seen in¶
- sources/2024-03-07-flyio-fly-kubernetes-does-more-now — named as the "Service" implementation under FKS; the target of FKS's Service-sync reconciler; and the backing for flycast-based private load balancing.
- sources/2024-05-09-flyio-picture-this-open-source-ai-for-image-description
— Fly Proxy as the autostart/autostop controller for a GPU
Ollama Machine. The cold-start tail the proxy chooses to accept
(~45 s on
a100-40gb+ LLaVA-34b) is the trade the GPU-cost-control pattern makes. - sources/2025-04-08-flyio-our-best-customers-are-now-robots — Fly Proxy's dynamic request routing as a robot attractant for MCP SSE workloads. Canonical wiki statement: "More recent MCP flows involve repeated and potentially long-lived (SSE) connections. To make this work in a multitenant environment, you want these connections to hit the same (stateful) instance. So we think it's possible that the control we give over request routing is a robot attractant." Fly's dynamic request routing lets the tenant pin connection-level affinity per client, which is exactly what multitenant MCP-SSE deployments need (long-lived SSE session affinity). Canonical instance of patterns/session-affinity-for-mcp-sse and an RX data point on the networking axis.
Anycast router architecture — the state-distribution problem¶
Per the 2025-05-28 parking_lot debugging retrospective, the
load-bearing difficulty of fly-proxy is not the proxy
logic per se — it's managing millions of connections for
millions of apps across 30+ regions where Machines can start
in <1 s and terminate instantly:
"This is the hard problem: managing millions of connections for millions of apps. It's a lot of state to manage, and it's in constant flux. We refer to this as the 'state distribution problem', but really, it quacks like a routing protocol." (Source: sources/2025-05-28-flyio-parking-lot-ffffffffffffffff)
RIB and FIB¶
Analogous to a network router's RIB (system of record) and
FIB (fast-forwarding table), fly-proxy pairs:
- Corrosion2 — global CRDT-SQLite SWIM-gossip state (the RIB); updates propagate host-to-host in millisecond intervals.
- The Catalog —
fly-proxy's in-memory aggregation of Corrosion state for fast routing decisions (the FIB): "a record of everything in Corrosion a proxy might need to know about to forward requests."
Regionalization¶
Historically the Catalog was a global broadcast domain —
every proxy received updates for every Fly Machine. The 2024
outage (the if let
read-lock-over-both-arms bug) was triggered by an update
about an app nobody used, which had no business reaching
most proxies. Fly is mid-migration to a regionalized Catalog
where most updates stay within the region. "Why? Because it
scales, and fixing it turns out to be a huge lift. It's a
lift we're still making!"
Catalog RWLock + Rust concurrency stack¶
The Catalog is protected by
parking_lot RWLocks. Readers (request-forwarding
threads) dominate; writers (Corrosion update consumers)
arrive on state updates.
2024 outage class — if let read-guard over both if
and else arms: canonical
if-let-lock-scope-bug
instance. Fixed by eliminating if let-over-lock shapes
fleet-wide.
2024 safety net — Watchdog on an internal REPL control channel kills the proxy when the channel becomes nonresponsive (deadlock / dead-loop / exhaustion). Canonical REPL-channel liveness probe + watchdog-bounce instance. Core dump collected on every kill.
2025 class — parking_lot try_write_for timeout +
reader-release wake-path race caused a
bitwise double-free that
corrupted the lock word to 0xFFFFFFFFFFFFFFFF and produced
artificial deadlocks (every thread waiting; no owner).
Fixed upstream:
parking_lot PR #466
— Fly.io's sixth upstream-the-fix
instance on the wiki.
Permanent improvements from the 2025 incident¶
- All RAII lock guards replaced with explicit closures (see patterns/raii-to-explicit-closure-for-lock-visibility).
- All Catalog write locks use
try_write_forbounded acquisition with labeled-log + metric telemetry (see patterns/lock-timeout-for-contention-telemetry). - Last-and-current writer identity tracked with context
information (app IDs, call-site). "Next time we have a
deadlock, we should have all the information we need to
identify the actors without
gdbstack traces."
Seen in (additional)¶
-
sources/2025-05-28-flyio-parking-lot-ffffffffffffffff — Primary architectural source on
fly-proxy's role as Fly.io's Anycast router (complement to the edge / service-discovery role surfaced by earlier Fly Kubernetes and Ollama posts). Documents the RIB/FIB split with Corrosion, the state-distribution problem as the load-bearing difficulty, the 2024 global Anycast outage from anif letread-lock-scope bug, the watchdog + REPL-channel safety net that made deadlocks nonlethal, the regionalization effort to confine routing updates to regions, the lazy-loading Catalog refactor that exposed the 2025parking_lotbug, and the RWLock-contention / lock-word-corruption debugging arc — canonical instance of patterns/upstream-the-fix (Fly.io's second Rust upstream) + concepts/descent-into-madness-debugging. -
sources/2025-10-22-flyio-corrosion — confirmation of the RIB/FIB framing from the RIB side. The 2024-09-01 contagious deadlock in fly-proxy is the framing outage of the canonical Corrosion post; the fix catalogue (Tokio watchdogs fleet-wide, Antithesis investment, whole-object republish, regionalization) directly benefits fly-proxy's consumer-side reliability. fly-proxy is the canonical FIB consumer of the RIB that post describes.