Skip to content

CONCEPT Cited by 1 source

qlog QUIC instrumentation

Definition

qlog is the standardised JSON event-log format for QUIC and HTTP/3, maintained as an IETF QUIC WG draft and adopted by the major open-source QUIC implementations (including quiche). Each event — packet send, packet receive, loss event, congestion-state transition, flow-control update, RTT estimator sample — is emitted as a JSON record with a structured schema. The result is a byte-for-byte replayable trace of what the transport did, compatible across implementations and consumable by visualiser tools like qvis.quictools.info.

Why it matters for QUIC debugging

QUIC bugs often live in subtle state-machine interactions that are invisible in throughput metrics. The 2026-05-12 Cloudflare quiche CUBIC minimum-cwnd death spiral post is a canonical illustration:

  • A throughput dashboard would show the connection isn't making progress. It would not show that CUBIC is flipping between congestion_avoidance and recovery every ~14 ms.
  • A CPU profiler would show quiche is busy processing ACKs. It would not show that cwnd is locked at two packets.
  • A static code review can read the exact on_packet_sent() logic that causes the bug and not see the problem — because the problem only manifests when bytes_in_flight == 0 holds on every ACK cycle, a state that requires adversarial conditions to reach.

qlog is the only substrate that makes this kind of bug visible. Cloudflare's post makes the role explicit:

"We instrumented quiche's qlog output with packet loss events and built visualizations to understand what was happening inside the congestion controller." (Source: sources/2026-05-12-cloudflare-when-idle-isnt-idle-how-a-linux-kernel-optimization-became-a-quic-bug)

The 999 state transitions in 6.7 s + ~14 ms oscillation period + 2700-byte locked cwnd — the three observations that together diagnosed the bug — all came from qlog-derived visualisations.

What qlog events cover

Canonical event categories in the qlog schema:

  • Transport: packet sends/receives, versions, datagrams, stream state, flow-control windows.
  • Recovery: loss events, retransmissions, cwnd / ssthresh / bytes_in_flight snapshots, congestion-state transitions, RTT estimator samples (smoothed RTT, min RTT, latest RTT), probe timeouts.
  • Security: TLS handshake events, key updates.
  • HTTP/3: frame encoding/decoding, stream multiplexing decisions.

For CCA debugging specifically: every cwnd change, every bytes_in_flight adjustment, every state transition (slow-start ↔ congestion-avoidance ↔ recovery) is recordable, with timestamps precise enough to reconstruct the ACK clock.

Why JSON + IETF standardisation matters

Two properties are load-bearing:

  1. Cross-implementation comparison. Because qlog is an open standard, the same visualiser can ingest traces from quiche, quinn, ngtcp2, mvfst, and picoquic. When a bug appears in one implementation, comparing traces against another implementation under the same conditions is a first-line diagnostic move. (Cloudflare's CUBIC-vs-Reno diagnostic in the 2026-05-12 post is adjacent to this workflow — same library, different CCA, same qlog format.)
  2. Sharable artefacts. A qlog trace is a self-contained text file; it can be attached to a bug report, shared with a peer implementation team, or archived for longitudinal analysis. The qvis family of visualisers renders traces in-browser.

Structural comparison to TCP observability

TCP's equivalent observability substrate is the packet capture (pcap) plus kernel-counters (ss, netstat, /proc/net/tcp). Two important differences:

  • qlog is structured-event-stream, not packet-level. CCA state transitions, cwnd samples, and loss events are first-class records — in TCP they must be reconstructed from packet captures + kernel counters.
  • qlog runs in user space, alongside the quiche library. TCP observability requires privileged kernel-level tools (tcpdump, BPF, ss).

The user-space CCA consequence: quiche's user-space congestion control means qlog can trivially capture every CCA state change the implementer cares to emit — which is exactly what made the 2026-05-12 bug findable at all.

Seen in

  • sources/2026-05-12-cloudflare-when-idle-isnt-idle-how-a-linux-kernel-optimization-became-a-quic-bug — canonical wiki instance. qlog instrumentation (plus custom visualisations on top of it) is how Cloudflare diagnosed the CUBIC minimum-cwnd death spiral: the 999 state-transition count, the 14 ms oscillation period, the 2,700-byte locked cwnd, and the congestion_avoidance ↔ recovery flip pattern were all extracted from qlog time series. "After weeks of instrumenting qlogs and analyzing visualizations to find the root cause, the solution required changing just three lines of code."
Last updated · 542 distilled / 1,571 read