CONCEPT Cited by 1 source

Intertoken latency (ITL / time-per-token)¶

Definition¶

Intertoken latency (ITL), also called time-per-output-token (TPOT), is the LLM-serving latency metric measuring the delay between successive output tokens during streamed generation. If TTFT is how long a user waits to see something, ITL is how fast text streams once it starts — the reading-speed metric.

Canonical wiki instance with real numbers: Cloudflare Workers AI 2026-04-16 — p90 ITL: ~100 ms with high variance → 20-30 ms ≈ 3× improvement after shifting to PD disaggregation, using the same quantity of GPUs. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

What it measures¶

Each decode step produces one token. ITL is the wall-clock time between token t emitted and token t+1 emitted. In streaming, this is the gap between adjacent SSE events. Factors:

Per-step decode compute — one forward pass through the model, dominated by HBM bandwidth loading weights + KV cache (memory-bound).
KV cache read cost — grows with context length.
Queueing — if the replica serves other requests, the per-step slot is shared.
Pipeline-stage-bubble cost under pipeline parallelism.
Network / SSE-framing overhead between server and client.

Why "memory-bound" matters for ITL¶

Decode is fundamentally memory-bandwidth-bound: every forward pass re-loads the (sharded) model weights + the (growing) KV cache into compute. The memory-bound regime means:

Batching raises utilisation — multiple in-flight requests share weight-loads.
Arithmetic intensity is the ceiling; adding compute doesn't help past that.

This is the reason PD disaggregation matters: separating decode from prefill lets decode-side servers be provisioned for HBM bandwidth rather than compute, and eliminates the per-step contention with compute-hungry prefills arriving on the same GPU.

Cloudflare's 2026-04 result¶

"p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Same GPU count.
Request volume increased during the window.
Variance dropped — high-variance before meant some users saw much worse than p90; after PD disaggregation that spread shrank.

Speculative decoding's effect on ITL¶

Speculative decoding with a drafter (EAGLE-3 in Cloudflare's case) doesn't change the per-forward-pass latency directly, but changes the ratio of accepted tokens per expert-forward-pass. If the drafter has a 3:1 acceptance ratio, three output tokens are emitted per expert forward-pass — effectively reducing ITL by the same factor.

Cloudflare does not disclose the acceptance ratio, so the ITL contribution of speculative decoding specifically is not quantified.

Design implications¶

Long contexts inflate ITL via KV-cache-read cost — mitigated by efficient KV-cache layout (paged, tensor-parallel sharding).
Tail variance is a sharper symptom than mean ITL for user experience — visible as "chatty starts, then pauses."
Request count on a decode replica directly inflates per-request ITL via per-step round-robin sharing.

TTFT — first-token cost; prefill-dominated.
Total generation latency = TTFT + (output_length × ITL).
Tokens-per-second throughput = effective reciprocal of ITL at load; what Infire's "20% higher tokens per second throughput on unconstrained systems" number refers to.

Caveats¶

Cloudflare's ~100 ms → 20-30 ms p90 number is relative. No context-length or model-size conditional breakdown.
Same GPU count is stated but the PD disaggregation refactor may have redistributed those GPUs between prefill and decode roles — the per-role utilisation shift isn't characterised.
Interaction with speculative decoding, session affinity, and model-size changes not decomposed.

Seen in¶

sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance with specific p90 numbers.