FLYIO 2025-02-26 Tier 3

Fly.io — Taming a Voracious Rust Proxy¶

Summary¶

A Fly.io incident retrospective (2025-02-26, Tier 3) tracing a CPU-runaway + HTTP-error-spike incident on a couple of IAD edge servers to a TLS-close-notify state-machine bug in tokio-rustls that put an fly-proxy TlsStream into a busy-polling spurious-wakeup loop. The bug was triggered when a partner — Tigris Data — ran a load test whose connections closed with buffered data still on the underlying socket. The flamegraph showed Rust tracing's Subscriber dominating CPU time, which was the giveaway (entering/exiting a tokio span is supposed to be almost free, so if it dominates, the traced code is doing almost nothing and the containing Future is being poll'd in a tight loop). The fix was [upstream rustls PR

1950](https://github.com/rustls/rustls/pull/1950/files) — canonical¶

patterns/upstream-the-fix / patterns/flamegraph-to-upstream-fix instance.

Key takeaways¶

Symptom: two edge tripwires tripped in IAD — elevated fly-proxy HTTP errors + skyrocketing CPU utilisation on a couple of hosts. Bouncing fly-proxy cleared it; it came back. "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies." (Source: article body.)
Diagnostic move: Pavel (proxy team) pulled a flamegraph from an angry proxy. A huge chunk was dominated by Rust tracing's Subscriber — "fuckin' weird", because entering/exiting a span in a Tokio stack is supposed to be very fast. If it dominates the profile, the traced code must be doing next to nothing — the whole Future is being poll'd in a tight loop.
Async-Rust primer embedded in the post worth preserving on the wiki: a Future is a state machine exposing one op, poll. Tokio drives it by passing a Waker — the Waker is the handle the Future uses to tell Tokio "something happened, poll me again." AsyncRead builds on Futures and returns Ready every time there's data ready — the caller keeps poll_read-ing until it stops being Ready.
Two footguns in this design (explicit in the article): (a) a poll of a Pending Future that accidentally trips its own Waker — infinite loop; (b) an AsyncRead that poll_read-returns Ready without actually progressing its underlying state machine — also an infinite loop because the caller keeps asking. The profile pattern at Fly.io was the second: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O" (Source: article body). Canonical wiki instance of concepts/spurious-wakeup-busy-loop.

Suspect narrowing via the fully-qualified Future type in the flamegraph:

&mut fp_io::copy::Duplex<&mut fp_io::reusable_reader::ReusableReader<
  fp_tcp::peek::PeekableReader<
    tokio_rustls::server::TlsStream<
      fp_tcp_metered::MeteredIo<
        fp_tcp::peek::PeekableReader<
          fp_tcp::permitted::PermittedTcpStream>>>>>,
  connect::conn::Conn<tokio::net::tcp::stream::TcpStream>>

Fly's own wrapper types don't touch Waker directly; that left Duplex (not recently changed, can't reproduce) and TlsStream (from Rustls via tokio-rustls) — which does have to reach into the async executor.

Root cause: rustls/tokio-rustls#72 — on orderly TLS shutdown with a CloseNotify Alert record, the sender has declared no more data will be sent; but if the underlying socket still has buffered bytes, TlsStream mishandles its Waker and falls into a busy-loop. Canonical concepts/tls-close-notify edge case — the close_notify-with-buffered-trailer scenario is rare enough that it didn't show in normal traffic.
Trigger: Tigris Data — Fly's object-storage partner — was running a load test. Traffic volume was modest (tens of thousands of connections) but each connection sent a small HTTP body and terminated early, which was enough to make some fraction hit the "close_notify happened before EOF" state. Fly asked Tigris to stop the load test while investigating; resumed after deploying the fix — "no spin-outs."
Fix: upstream rustls PR #1950 — described as "pretty straightforward." Canonical patterns/upstream-the-fix instance — the bug is in a shared ecosystem primitive used by everybody who does TLS in Rust, so the fix goes upstream, not around it.
Lessons the post draws on itself (sharp, recorded verbatim):
"Keep your dependencies updated. Unless you shouldn't keep your dependencies updated." Always update for vulnerabilities (this was technically a DoS vulnerability) and important bugfixes; otherwise "updating for the hell of it might also destabilize your project." The real problem is the process and test infrastructure to metabolise updates confidently — not the updates themselves. (Canonical wiki instance of patterns/dependency-update-discipline.)
"Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often. So that's something we'll go do now." — canonical wiki statement of patterns/spurious-wakeup-metric as a cheap instrumentation primitive that would have caught this earlier.

Architectural / systems content¶

Edge vs worker split (recap of prior Fly.io framing, useful on the source page for context): Fly.io's hardware fleet is "roughly divided into two kinds of servers: edges, which receive incoming requests from the Internet, and workers, which run Fly Machines. Edges exist almost solely to run a Rust program called fly-proxy, the router at the heart of our Anycast network." This is the one-line wiki statement of the fly-proxy edge role.

Incident-response process (light detail): Fly.io uses Rootly as their incident-management tool — "we ❤️ Rootly for this, seriously check out Rootly, an infra MVP here for years now." Not ingesting as its own system page but worth noting on fly-proxy's Seen-in for incident-process color. Incident channel spun up, responders quickly concluded "while something hinky was definitely going on, the platform was fine" — edge HTTP errors + CPU were localised to two hosts in one region; bouncing the proxy cleared it. The incident is the sequence of page → bounce → comes back → repeat, which is the signal that triggered the deeper investigation (Pavel pulling a profile) rather than continuing to mitigate tactically.

Duplex — Fly's own proxy I/O state machine — gets a brief characterisation from the post: "Duplex is a beast. It's the core I/O state machine for proxying between connections. It's not easy to reason about in specificity. But: it also doesn't do anything directly with a Waker; it's built around AsyncRead and AsyncWrite." Named but not deep-dived.

Numbers disclosed¶

Traffic volume at trigger: "tens of thousands of connections, tops" (Tigris load test) — the key qualitative framing is that the trigger did not require high volume.
Affected hosts: "a couple of hosts in IAD" — geographically localised.
Time-to-mitigation: "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies" — bounce-and-wait cycle measured in hours, pre-root-cause.
No CPU % numbers, no error-rate numbers, no request-rate numbers, no kernel-time vs user-time split from the flamegraph (qualitative: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O").

Numbers not disclosed¶

No fly-proxy fleet-wide QPS / connection-count.
No rustls / tokio-rustls version pre-fix.
No before/after CPU utilisation.
No percentage of connections that hit the close_notify- with-buffered-trailer state.
No Duplex internal structure.
No production rollout cadence for the rustls upgrade.
No list of downstream impacted customers (only Tigris is named, and explicitly as "not the cause, just the trigger").

Caveats¶

This is a Tier 3 Fly.io post but passes the AGENTS.md scope filter squarely: production-incident retrospective, distributed-systems internals (tokio scheduling + async-rust state machines), a named bug + upstream fix (tokio-rustls + rustls), and two reusable lessons. Not a product-PR post.
The async-Rust primer is a teaching summary, not a formal spec — readers who want precise Future / Waker / AsyncRead semantics should read Tokio's documentation and the std::future RFC.
The post credits Pavel by first name but does not give full attribution — we cite the post, not the individual.
"Rustls mishandles its Waker" is the post's characterisation; the authoritative technical detail is in issue #72
PR #1950.

Relationship to existing wiki¶

Extends systems/fly-proxy — previously stubbed as the FKS Service backend; this source adds the edge-router framing ("edges exist almost solely to run fly-proxy") and the first production-incident Seen-in.
Extends systems/tigris — adds the incident-trigger context (Tigris as a Fly.io partner running load tests whose close-pattern exposed a tokio-rustls bug in Fly.io's edge path). Not a Tigris fault; a Tigris trigger.
Extends concepts/anycast — adds another Fly.io citation for the "fly-proxy, the router at the heart of our Anycast network" framing.
Extends patterns/upstream-the-fix — adds a Rust- ecosystem instance (Fly.io → rustls PR #1950) to the prior Cloudflare (V8 / Node.js / OpenNext) and Datadog (containerd / kubernetes / go-cmp) instances. Pattern is now three- ecosystem-language-wide.
Introduces five new concept pages (concepts/async-rust-future, concepts/rust-waker, concepts/asyncread-contract, concepts/spurious-wakeup-busy-loop, concepts/tls-close-notify, plus concepts/cpu-busy-loop-incident and concepts/flamegraph-profiling) and three new systems (systems/rustls, systems/tokio-rustls, systems/tokio) — the Rust-async-ecosystem primitives this story sits on.
Introduces two new patterns (patterns/flamegraph-to-upstream-fix, patterns/dependency-update-discipline, patterns/spurious-wakeup-metric).

Source¶

Original: https://fly.io/blog/taming-rust-proxy/
Raw markdown: raw/flyio/2025-02-26-taming-a-voracious-rust-proxy-c894518a.md