Fly.io — Taming a Voracious Rust Proxy¶
Summary¶
A Fly.io incident retrospective (2025-02-26, Tier 3) tracing a
CPU-runaway + HTTP-error-spike incident on a couple of IAD edge
servers to a TLS-close-notify state-machine bug in
tokio-rustls that put
an fly-proxy TlsStream into a
busy-polling spurious-wakeup
loop. The bug was triggered when a partner — Tigris
Data — ran a load test whose connections closed with buffered
data still on the underlying socket. The flamegraph showed Rust
tracing's Subscriber dominating CPU time, which was the giveaway
(entering/exiting a tokio span is supposed to be almost free, so if
it dominates, the traced code is doing almost nothing and the
containing Future is being
poll'd in a tight loop). The fix was [upstream rustls PR
1950](https://github.com/rustls/rustls/pull/1950/files) — canonical¶
patterns/upstream-the-fix / patterns/flamegraph-to-upstream-fix instance.
Key takeaways¶
- Symptom: two edge tripwires tripped in
IAD— elevatedfly-proxyHTTP errors + skyrocketing CPU utilisation on a couple of hosts. Bouncingfly-proxycleared it; it came back. "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies." (Source: article body.) - Diagnostic move: Pavel (proxy team) pulled a
flamegraph from an angry
proxy. A huge chunk was dominated by Rust
tracing'sSubscriber— "fuckin' weird", becauseentering/exitinga span in a Tokio stack is supposed to be very fast. If it dominates the profile, the traced code must be doing next to nothing — the wholeFutureis being poll'd in a tight loop. - Async-Rust primer embedded in the post worth preserving on
the wiki: a
Futureis a state machine exposing one op,poll. Tokio drives it by passing aWaker— theWakeris the handle theFutureuses to tell Tokio "something happened, poll me again."AsyncReadbuilds on Futures and returnsReadyevery time there's data ready — the caller keepspoll_read-ing until it stops beingReady. - Two footguns in this design (explicit in the article):
(a) a
pollof aPendingFuture that accidentally trips its own Waker — infinite loop; (b) anAsyncReadthatpoll_read-returnsReadywithout actually progressing its underlying state machine — also an infinite loop because the caller keeps asking. The profile pattern at Fly.io was the second: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O" (Source: article body). Canonical wiki instance of concepts/spurious-wakeup-busy-loop. - Suspect narrowing via the fully-qualified
Futuretype in the flamegraph:Fly's own wrapper types don't touch&mut fp_io::copy::Duplex<&mut fp_io::reusable_reader::ReusableReader< fp_tcp::peek::PeekableReader< tokio_rustls::server::TlsStream< fp_tcp_metered::MeteredIo< fp_tcp::peek::PeekableReader< fp_tcp::permitted::PermittedTcpStream>>>>>, connect::conn::Conn<tokio::net::tcp::stream::TcpStream>>Wakerdirectly; that leftDuplex(not recently changed, can't reproduce) andTlsStream(from Rustls via tokio-rustls) — which does have to reach into the async executor. - Root cause:
rustls/tokio-rustls#72
— on orderly TLS shutdown with a
CloseNotifyAlertrecord, the sender has declared no more data will be sent; but if the underlying socket still has buffered bytes,TlsStreammishandles itsWakerand falls into a busy-loop. Canonical concepts/tls-close-notify edge case — theclose_notify-with-buffered-trailer scenario is rare enough that it didn't show in normal traffic. - Trigger: Tigris Data — Fly's object-storage partner — was running a load test. Traffic volume was modest (tens of thousands of connections) but each connection sent a small HTTP body and terminated early, which was enough to make some fraction hit the "close_notify happened before EOF" state. Fly asked Tigris to stop the load test while investigating; resumed after deploying the fix — "no spin-outs."
- Fix: upstream rustls PR #1950 — described as "pretty straightforward." Canonical patterns/upstream-the-fix instance — the bug is in a shared ecosystem primitive used by everybody who does TLS in Rust, so the fix goes upstream, not around it.
- Lessons the post draws on itself (sharp, recorded verbatim):
- "Keep your dependencies updated. Unless you shouldn't keep your dependencies updated." Always update for vulnerabilities (this was technically a DoS vulnerability) and important bugfixes; otherwise "updating for the hell of it might also destabilize your project." The real problem is the process and test infrastructure to metabolise updates confidently — not the updates themselves. (Canonical wiki instance of patterns/dependency-update-discipline.)
- "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often. So that's something we'll go do now." — canonical wiki statement of patterns/spurious-wakeup-metric as a cheap instrumentation primitive that would have caught this earlier.
Architectural / systems content¶
Edge vs worker split (recap of prior Fly.io framing, useful
on the source page for context): Fly.io's hardware fleet is
"roughly divided into two kinds of servers: edges, which
receive incoming requests from the Internet, and workers, which
run Fly Machines. Edges exist almost solely to run a Rust
program called fly-proxy, the router at the heart of our
Anycast network." This is the one-line
wiki statement of the fly-proxy edge role.
Incident-response process (light detail): Fly.io uses Rootly as their incident-management tool — "we ❤️ Rootly for this, seriously check out Rootly, an infra MVP here for years now." Not ingesting as its own system page but worth noting on fly-proxy's Seen-in for incident-process color. Incident channel spun up, responders quickly concluded "while something hinky was definitely going on, the platform was fine" — edge HTTP errors + CPU were localised to two hosts in one region; bouncing the proxy cleared it. The incident is the sequence of page → bounce → comes back → repeat, which is the signal that triggered the deeper investigation (Pavel pulling a profile) rather than continuing to mitigate tactically.
Duplex — Fly's own proxy I/O state machine — gets a brief
characterisation from the post: "Duplex is a beast. It's the
core I/O state machine for proxying between connections. It's
not easy to reason about in specificity. But: it also doesn't do
anything directly with a Waker; it's built around AsyncRead
and AsyncWrite." Named but not deep-dived.
Numbers disclosed¶
- Traffic volume at trigger: "tens of thousands of connections, tops" (Tigris load test) — the key qualitative framing is that the trigger did not require high volume.
- Affected hosts: "a couple of hosts in
IAD" — geographically localised. - Time-to-mitigation: "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies" — bounce-and-wait cycle measured in hours, pre-root-cause.
- No CPU % numbers, no error-rate numbers, no request-rate numbers, no kernel-time vs user-time split from the flamegraph (qualitative: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O").
Numbers not disclosed¶
- No
fly-proxyfleet-wide QPS / connection-count. - No rustls / tokio-rustls version pre-fix.
- No before/after CPU utilisation.
- No percentage of connections that hit the
close_notify- with-buffered-trailer state. - No
Duplexinternal structure. - No production rollout cadence for the rustls upgrade.
- No list of downstream impacted customers (only Tigris is named, and explicitly as "not the cause, just the trigger").
Caveats¶
- This is a Tier 3 Fly.io post but passes the AGENTS.md scope filter squarely: production-incident retrospective, distributed-systems internals (tokio scheduling + async-rust state machines), a named bug + upstream fix (tokio-rustls + rustls), and two reusable lessons. Not a product-PR post.
- The async-Rust primer is a teaching summary, not a formal
spec — readers who want precise
Future/Waker/AsyncReadsemantics should read Tokio's documentation and thestd::futureRFC. - The post credits Pavel by first name but does not give full attribution — we cite the post, not the individual.
- "Rustls mishandles its Waker" is the post's characterisation; the authoritative technical detail is in issue #72
- PR #1950.
Relationship to existing wiki¶
- Extends systems/fly-proxy — previously stubbed as the
FKS Service backend; this source adds the edge-router
framing ("edges exist almost solely to run
fly-proxy") and the first production-incident Seen-in. - Extends systems/tigris — adds the incident-trigger context (Tigris as a Fly.io partner running load tests whose close-pattern exposed a tokio-rustls bug in Fly.io's edge path). Not a Tigris fault; a Tigris trigger.
- Extends concepts/anycast — adds another Fly.io citation
for the "
fly-proxy, the router at the heart of our Anycast network" framing. - Extends patterns/upstream-the-fix — adds a Rust- ecosystem instance (Fly.io → rustls PR #1950) to the prior Cloudflare (V8 / Node.js / OpenNext) and Datadog (containerd / kubernetes / go-cmp) instances. Pattern is now three- ecosystem-language-wide.
- Introduces five new concept pages (concepts/async-rust-future, concepts/rust-waker, concepts/asyncread-contract, concepts/spurious-wakeup-busy-loop, concepts/tls-close-notify, plus concepts/cpu-busy-loop-incident and concepts/flamegraph-profiling) and three new systems (systems/rustls, systems/tokio-rustls, systems/tokio) — the Rust-async-ecosystem primitives this story sits on.
- Introduces two new patterns (patterns/flamegraph-to-upstream-fix, patterns/dependency-update-discipline, patterns/spurious-wakeup-metric).
Source¶
- Original: https://fly.io/blog/taming-rust-proxy/
- Raw markdown:
raw/flyio/2025-02-26-taming-a-voracious-rust-proxy-c894518a.md
Related¶
- systems/fly-proxy
- systems/rustls
- systems/tokio-rustls
- systems/tokio
- systems/tigris
- concepts/async-rust-future
- concepts/rust-waker
- concepts/asyncread-contract
- concepts/spurious-wakeup-busy-loop
- concepts/tls-close-notify
- concepts/cpu-busy-loop-incident
- concepts/flamegraph-profiling
- concepts/anycast
- patterns/flamegraph-to-upstream-fix
- patterns/upstream-the-fix
- patterns/dependency-update-discipline
- patterns/spurious-wakeup-metric
- companies/flyio