Skip to content

PATTERN Cited by 1 source

Userspace port of kernel primitive — inherited-bug risk

Pattern

When porting a kernel-space primitive to user space, the callback / event boundaries available in the kernel are rarely available one-for-one in user space — and follow-up bug fixes to the kernel primitive may not be visible to the userspace implementer years later when the userspace bug surfaces.

Two distinct risks compose:

  1. Callback-shape mismatch. The kernel version hooks into a specific OS event (e.g. CA_EVENT_TX_START in Linux TCP for the "about to transmit" moment). The user-space version has to approximate that hook from a different call site (e.g. inside a send-path function), and the approximation may be structurally weaker.
  2. Follow-up-fix visibility gap. If the kernel primitive ships a follow-up bug fix within days or weeks, the userspace port that happened years later may inherit only the original fix. The reviewer of the port sees the "canonical" kernel commit and reads it as the definitive version; the week-later correction is in a different commit with different authors and may not be linked from the primary.

Canonical instance: the 2020 port of Linux TCP CUBIC's "after idle" optimisation into quiche's on_packet_sent() inherited the 2017 Linux fix but not the 1-week-later follow-up that said "do not set epoch_start in the future" (Source: sources/2026-05-12-cloudflare-when-idle-isnt-idle-how-a-linux-kernel-optimization-became-a-quic-bug).

The canonical instance in detail

2017. Jana Iyengar reports: Linux TCP CUBIC inflates cwnd dangerously after an application-idle period because delta_t = now − epoch_start grows during idleness.

Neal Cardwell corrects an initial "reset epoch" proposal: resetting would restart the growth curve from cwnd's current value, behaving like a loss. The accepted fix (30927520dbae) shifts epoch_start forward by the idle duration, preserving the growth-curve shape.

~1 week later. A follow-up commit (c2e7204d180f) fixes a bug in the first fix: "Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start is normally set at ACK processing time, not at send time. Doing a proper fix would need to add an additional state variable, and does not seem worth the trouble, given CUBIC bug has been there forever before Jana noticed it. Let's simply not set epoch_start in the future, otherwise bictcp_update() could overflow and CUBIC would again grow cwnd too fast."

The fix ships as a guard: if the arithmetic would set epoch_start > now, clamp it.

2020. Cloudflare ports the 2017 CUBIC-after-idle optimisation into quiche. Because QUIC runs in user space, there is no CA_EVENT_TX_START callback. The port instead puts the idle-detection logic inside on_packet_sent(), using bytes_in_flight == 0 as the idle predicate and now - last_sent_time as the delta:

// cubic.rs — on_packet_sent() (2020 port, buggy)
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    if bytes_in_flight == 0 {
        let delta = now - self.last_sent_time;
        self.congestion_recovery_start_time += delta;
    }
    self.last_sent_time = now;
}

This port never received the 1-week-later kernel guard. Worse, the callback-shape mismatch (no ACK-processing hook) changes the semantics of bytes_in_flight == 0 from "application was idle between ACK and send" to "transient drain between ACK and next send at minimum cwnd". Those are not the same condition.

2026-05-12. Cloudflare's CI integration test (patterns/adversarial-corner-case-test-for-recovery) surfaces the CUBIC minimum-cwnd death spiral. The fix adds a last_ack_time state variable to approximate the kernel's ACK-processing anchor — essentially the "additional state variable" the 2017 follow-up commit mentioned but declined to add.

Structural causes of the gap

  1. Different call-site primitives. Kernel TCP has CA_EVENT_TX_START, cwnd_event, pkts_acked, and others as distinct hooks. User-space QUIC has on_packet_sent, on_ack_received, on_packet_lost as the closest equivalents — mostly but not exactly overlapping. Any logic that depends on the fine-grained distinction between those hooks will port wrong.
  2. Follow-up commits don't self-link. Git history shows parent commits but not child commits. A reviewer finding the 2017 primary commit by title or by the Cloudflare blog's link to it will not automatically see the correction commit unless they check git log --follow on the file.
  3. The bug in the uncorrected code is invisible at normal operation. It only fires at minimum cwndcorner of state space — so the reviewed PR passes normal tests, the port ships, and the bug lies dormant for years.
  4. Bug visibility differs by substrate. At kernel-level, CUBIC's epoch-in-the-future bug manifests under similar conditions as quiche's death spiral, but the Linux community caught it in a week. Cloudflare caught the quiche version in six years — partly because QUIC adoption at minimum-cwnd-reaching regimes took time, and partly because qlog-based diagnosis is newer.

Defensive disciplines

  • When porting, read the file history, not just the named commit. Check for follow-up commits that cite the primary in their commit message.
  • Map the kernel's callback semantics to userspace anchors explicitly. If the kernel primitive runs on ACK processing and the userspace port runs on send, document that the semantic changes and argue why that's OK — or fix the port to approximate the ACK anchor (e.g. via last_ack_time).
  • Test the ported primitive at the same corner cases the kernel tests it in. CUBIC's post-loss minimum-cwnd regime is exactly the scenario the kernel's idle-period fix addresses; the port's regression test should hit it.
  • Include the kernel-primitive authors as reviewers when possible. The people who wrote the fix and the fix-of- the-fix are most likely to notice if a port misses the subtlety.

Generalisation beyond CCAs

The pattern applies to any kernel primitive being ported to user space:

  • io_uring usage patterns ported from kernel async-I/O idioms.
  • epoll / kqueue event-loop abstractions in user-space frameworks.
  • Raw-socket / AF_PACKET protocol implementations moved into user-space packet processors.
  • Scheduler heuristics (e.g. completely-fair-scheduler logic) adapted into user-space cooperative-scheduling runtimes.

In every case: the kernel's events and primitives don't map one-to-one to user-space callsites; the first port is rarely the last; and the kernel community's follow-up fixes are the teacher you want to learn from.

Seen in

Last updated · 542 distilled / 1,571 read