How we found a bug in the hyper HTTP library¶
Summary¶
Cloudflare's Images team discovered a race condition in hyper (Rust's most widely used HTTP library) that caused silent response truncation on their Images binding. After a December 2025 rearchitecture that replaced the FL intermediary with local Unix sockets, larger image responses were intermittently cut short — the HTTP response returned 200 OK with a correct Content-Length, but only a fraction of the body arrived. After six weeks of debugging across application tracing, distributed tracing, and finally kernel-level strace, the root cause was identified as a discarded Poll::Pending in hyper's HTTP/1 dispatch loop. The fix was four lines of code; the investigation touched every layer of the stack.
Key Takeaways¶
-
The bug was invisible to application-level observability. Tracing, logging, and HTTP status codes all reported success. The Images service genuinely believed it had written the full response. Only
strace— recording raw syscalls — revealed the prematureshutdown(fd, SHUT_WR)immediately after a partialsendto. -
A performance improvement surfaced a latent bug. The December 2025 migration from FL (network sockets) to a local Unix-socket intermediary made the system faster overall, but the new reader consumed data slightly slower than FL in certain windows. This few-milliseconds difference in backpressure was enough to fill the socket buffer and trigger the race condition that had existed in hyper for years.
-
let _ = exprin Rust discardsPoll::Pendingsilently. In hyper'sdispatch.rs, thepoll_flushresult was discarded withlet _. When the socket buffer was full and the flush returnedPoll::Pending, the loop proceeded to checkwants_read_again()→false→ returnedPoll::Ready(Ok(()))→ triggered shutdown with data still buffered. This is a Rust async footgun: discarding aPollvalue without checking it can silently skip incomplete I/O. -
Approximately 219 KB was consistently the amount delivered in failing requests. This matched the kernel socket buffer size in production, confirming the hypothesis: only the initial chunk that fit in the socket buffer was sent before the connection was closed.
-
The bug reproduced only under real concurrency on the production path. Local curl, integration tests on macOS/Debian VMs, and replayed pcap traces never triggered it. The production Workers runtime reader paused just long enough (order of milliseconds) to fill the socket outbound buffer at the critical moment.
-
straceobservability changes the timing. Broadening the syscall filter slowed the process enough to shift the timing window and make the bug disappear — a Heisenbug. Only a narrow filter kept overhead minimal enough to still trigger the failure. -
Two fixes were explored. The initial fix returned
Poll::Pendingfrom the dispatch loop when flush was incomplete — correct for Cloudflare's use case but problematic for keepalive connections. The upstream-accepted fix places the flush check inpoll_shutdown(), ensuring all buffered data drains before the socket is closed, without altering the dispatch loop's polling semantics. -
A deterministic test was essential for the upstream contribution. A custom TCP stream wrapper that accepted 8 KB then returned
Poll::Pendingon all subsequent writes let the team write a test that reliably triggered the race without timing sensitivity.
Architectural Details¶
Request path (post-rearchitecture)¶
Client → Workers Runtime → Intermediary (local worker binding)
→ [Unix socket] → Images service (hyper) → [encodes image]
→ [hyper writes response to socket] → Intermediary → Workers Runtime → Client
The race condition sequence (failing case)¶
- Images service finishes encoding; hands full response (e.g. 14.9 MB) to hyper as one in-memory block.
- Hyper buffers the response; marks write state as
Writing::Closed(encoding complete). poll_flushis called — socket accepts ~219 KB, buffer is full, returnsPoll::Pending.let _discards thePendingsignal.wants_read_again()returnsfalse(full request already consumed).- Loop returns
Poll::Ready(Ok(()))— signals "connection work is done." poll_shutdown()fires →shutdown(fd, SHUT_WR)syscall issued.- Client receives 219 KB + EOF, despite expecting 14.9 MB.
The fix (upstream PR #4018)¶
pub(crate) fn poll_shutdown(
&mut self,
cx: &mut Context<'_>,
) -> Poll<io::Result<()>> {
ready!(self.poll_flush(cx)?);
Pin::new(&mut self.io).poll_shutdown(cx)
}
Ensures all buffered data is flushed before issuing shutdown. Leaves the dispatch loop unchanged, avoiding keepalive regressions.
Operational Numbers¶
- Response truncation: ~219 KB delivered out of multi-MB responses (socket buffer size limit)
- Reproduction rate: 19/25 requests failed in one batch run
- Bug existed in hyper: 0.14.x through 1.8.x (multiple major versions, years)
- Time to fix: 6 weeks investigation, 4 lines of code
Caveats¶
- The article doesn't disclose the exact socket buffer size configuration in production.
- The bug affects only HTTP/1.1 connections; HTTP/2 uses a different write path in hyper.
- Cloudflare's production still runs an internal fork while awaiting the upstream release.
Source¶
- Original: https://blog.cloudflare.com/hyper-bug/
- Raw markdown:
raw/cloudflare/2026-06-22-how-we-found-a-bug-in-the-hyper-http-library-14861279.md
Related¶
- concepts/backpressure — socket buffer filling is the backpressure trigger
- concepts/race-condition — the core bug class
- systems/hyper — the affected system
- systems/cloudflare-images — the Cloudflare product that surfaced the bug
- systems/cloudflare-workers — the runtime environment
- patterns/unix-socket-local-bypass — the architectural change that exposed the latent bug
- patterns/deterministic-test-for-timing-bug — the testing pattern used for the fix
- patterns/fix-at-shutdown-boundary — placing the guard at the narrowest point