Skip to content

FLYIO 2025-02-26 Tier 3

Read original ↗

Fly.io — Taming a Voracious Rust Proxy

Summary

A Fly.io incident retrospective (2025-02-26, Tier 3) tracing a CPU-runaway + HTTP-error-spike incident on a couple of IAD edge servers to a TLS-close-notify state-machine bug in tokio-rustls that put an fly-proxy TlsStream into a busy-polling spurious-wakeup loop. The bug was triggered when a partner — Tigris Data — ran a load test whose connections closed with buffered data still on the underlying socket. The flamegraph showed Rust tracing's Subscriber dominating CPU time, which was the giveaway (entering/exiting a tokio span is supposed to be almost free, so if it dominates, the traced code is doing almost nothing and the containing Future is being poll'd in a tight loop). The fix was [upstream rustls PR

1950](https://github.com/rustls/rustls/pull/1950/files) — canonical

patterns/upstream-the-fix / patterns/flamegraph-to-upstream-fix instance.

Key takeaways

  1. Symptom: two edge tripwires tripped in IAD — elevated fly-proxy HTTP errors + skyrocketing CPU utilisation on a couple of hosts. Bouncing fly-proxy cleared it; it came back. "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies." (Source: article body.)
  2. Diagnostic move: Pavel (proxy team) pulled a flamegraph from an angry proxy. A huge chunk was dominated by Rust tracing's Subscriber"fuckin' weird", because entering/exiting a span in a Tokio stack is supposed to be very fast. If it dominates the profile, the traced code must be doing next to nothing — the whole Future is being poll'd in a tight loop.
  3. Async-Rust primer embedded in the post worth preserving on the wiki: a Future is a state machine exposing one op, poll. Tokio drives it by passing a Waker — the Waker is the handle the Future uses to tell Tokio "something happened, poll me again." AsyncRead builds on Futures and returns Ready every time there's data ready — the caller keeps poll_read-ing until it stops being Ready.
  4. Two footguns in this design (explicit in the article): (a) a poll of a Pending Future that accidentally trips its own Waker — infinite loop; (b) an AsyncRead that poll_read-returns Ready without actually progressing its underlying state machine — also an infinite loop because the caller keeps asking. The profile pattern at Fly.io was the second: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O" (Source: article body). Canonical wiki instance of concepts/spurious-wakeup-busy-loop.
  5. Suspect narrowing via the fully-qualified Future type in the flamegraph:
    &mut fp_io::copy::Duplex<&mut fp_io::reusable_reader::ReusableReader<
      fp_tcp::peek::PeekableReader<
        tokio_rustls::server::TlsStream<
          fp_tcp_metered::MeteredIo<
            fp_tcp::peek::PeekableReader<
              fp_tcp::permitted::PermittedTcpStream>>>>>,
      connect::conn::Conn<tokio::net::tcp::stream::TcpStream>>
    
    Fly's own wrapper types don't touch Waker directly; that left Duplex (not recently changed, can't reproduce) and TlsStream (from Rustls via tokio-rustls) — which does have to reach into the async executor.
  6. Root cause: rustls/tokio-rustls#72 — on orderly TLS shutdown with a CloseNotify Alert record, the sender has declared no more data will be sent; but if the underlying socket still has buffered bytes, TlsStream mishandles its Waker and falls into a busy-loop. Canonical concepts/tls-close-notify edge case — the close_notify-with-buffered-trailer scenario is rare enough that it didn't show in normal traffic.
  7. Trigger: Tigris Data — Fly's object-storage partner — was running a load test. Traffic volume was modest (tens of thousands of connections) but each connection sent a small HTTP body and terminated early, which was enough to make some fraction hit the "close_notify happened before EOF" state. Fly asked Tigris to stop the load test while investigating; resumed after deploying the fix — "no spin-outs."
  8. Fix: upstream rustls PR #1950 — described as "pretty straightforward." Canonical patterns/upstream-the-fix instance — the bug is in a shared ecosystem primitive used by everybody who does TLS in Rust, so the fix goes upstream, not around it.
  9. Lessons the post draws on itself (sharp, recorded verbatim):
  10. "Keep your dependencies updated. Unless you shouldn't keep your dependencies updated." Always update for vulnerabilities (this was technically a DoS vulnerability) and important bugfixes; otherwise "updating for the hell of it might also destabilize your project." The real problem is the process and test infrastructure to metabolise updates confidently — not the updates themselves. (Canonical wiki instance of patterns/dependency-update-discipline.)
  11. "Spurious wakeups should be easy to spot, and triggering a metric when they happen should be cheap, because they're not supposed to happen often. So that's something we'll go do now." — canonical wiki statement of patterns/spurious-wakeup-metric as a cheap instrumentation primitive that would have caught this earlier.

Architectural / systems content

Edge vs worker split (recap of prior Fly.io framing, useful on the source page for context): Fly.io's hardware fleet is "roughly divided into two kinds of servers: edges, which receive incoming requests from the Internet, and workers, which run Fly Machines. Edges exist almost solely to run a Rust program called fly-proxy, the router at the heart of our Anycast network." This is the one-line wiki statement of the fly-proxy edge role.

Incident-response process (light detail): Fly.io uses Rootly as their incident-management tool — "we ❤️ Rootly for this, seriously check out Rootly, an infra MVP here for years now." Not ingesting as its own system page but worth noting on fly-proxy's Seen-in for incident-process color. Incident channel spun up, responders quickly concluded "while something hinky was definitely going on, the platform was fine" — edge HTTP errors + CPU were localised to two hosts in one region; bouncing the proxy cleared it. The incident is the sequence of page → bounce → comes back → repeat, which is the signal that triggered the deeper investigation (Pavel pulling a profile) rather than continuing to mitigate tactically.

Duplex — Fly's own proxy I/O state machine — gets a brief characterisation from the post: "Duplex is a beast. It's the core I/O state machine for proxying between connections. It's not easy to reason about in specificity. But: it also doesn't do anything directly with a Waker; it's built around AsyncRead and AsyncWrite." Named but not deep-dived.

Numbers disclosed

  • Traffic volume at trigger: "tens of thousands of connections, tops" (Tigris load test) — the key qualitative framing is that the trigger did not require high volume.
  • Affected hosts: "a couple of hosts in IAD" — geographically localised.
  • Time-to-mitigation: "For some number of hours, we're in an annoying steady-state of getting paged and bouncing proxies" — bounce-and-wait cycle measured in hours, pre-root-cause.
  • No CPU % numbers, no error-rate numbers, no request-rate numbers, no kernel-time vs user-time split from the flamegraph (qualitative: "samples that almost terminate in libc, but spend next to no time in the kernel doing actual I/O").

Numbers not disclosed

  • No fly-proxy fleet-wide QPS / connection-count.
  • No rustls / tokio-rustls version pre-fix.
  • No before/after CPU utilisation.
  • No percentage of connections that hit the close_notify- with-buffered-trailer state.
  • No Duplex internal structure.
  • No production rollout cadence for the rustls upgrade.
  • No list of downstream impacted customers (only Tigris is named, and explicitly as "not the cause, just the trigger").

Caveats

  • This is a Tier 3 Fly.io post but passes the AGENTS.md scope filter squarely: production-incident retrospective, distributed-systems internals (tokio scheduling + async-rust state machines), a named bug + upstream fix (tokio-rustls + rustls), and two reusable lessons. Not a product-PR post.
  • The async-Rust primer is a teaching summary, not a formal spec — readers who want precise Future / Waker / AsyncRead semantics should read Tokio's documentation and the std::future RFC.
  • The post credits Pavel by first name but does not give full attribution — we cite the post, not the individual.
  • "Rustls mishandles its Waker" is the post's characterisation; the authoritative technical detail is in issue #72
  • PR #1950.

Relationship to existing wiki

Source

Last updated · 200 distilled / 1,178 read