CONCEPT Cited by 1 source
Intertoken latency (ITL / time-per-token)¶
Definition¶
Intertoken latency (ITL), also called time-per-output-token (TPOT), is the LLM-serving latency metric measuring the delay between successive output tokens during streamed generation. If TTFT is how long a user waits to see something, ITL is how fast text streams once it starts — the reading-speed metric.
Canonical wiki instance with real numbers: Cloudflare Workers AI 2026-04-16 — p90 ITL: ~100 ms with high variance → 20-30 ms ≈ 3× improvement after shifting to PD disaggregation, using the same quantity of GPUs. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
What it measures¶
Each decode step produces one token. ITL is the wall-clock time between token t emitted and token t+1 emitted. In streaming, this is the gap between adjacent SSE events. Factors:
- Per-step decode compute — one forward pass through the model, dominated by HBM bandwidth loading weights + KV cache (memory-bound).
- KV cache read cost — grows with context length.
- Queueing — if the replica serves other requests, the per-step slot is shared.
- Pipeline-stage-bubble cost under pipeline parallelism.
- Network / SSE-framing overhead between server and client.
Why "memory-bound" matters for ITL¶
Decode is fundamentally memory-bandwidth-bound: every forward pass re-loads the (sharded) model weights + the (growing) KV cache into compute. The memory-bound regime means:
- Batching raises utilisation — multiple in-flight requests share weight-loads.
- Arithmetic intensity is the ceiling; adding compute doesn't help past that.
This is the reason PD disaggregation matters: separating decode from prefill lets decode-side servers be provisioned for HBM bandwidth rather than compute, and eliminates the per-step contention with compute-hungry prefills arriving on the same GPU.
Cloudflare's 2026-04 result¶
"p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
- Same GPU count.
- Request volume increased during the window.
- Variance dropped — high-variance before meant some users saw much worse than p90; after PD disaggregation that spread shrank.
Speculative decoding's effect on ITL¶
Speculative decoding with a drafter (EAGLE-3 in Cloudflare's case) doesn't change the per-forward-pass latency directly, but changes the ratio of accepted tokens per expert-forward-pass. If the drafter has a 3:1 acceptance ratio, three output tokens are emitted per expert forward-pass — effectively reducing ITL by the same factor.
Cloudflare does not disclose the acceptance ratio, so the ITL contribution of speculative decoding specifically is not quantified.
Design implications¶
- Long contexts inflate ITL via KV-cache-read cost — mitigated by efficient KV-cache layout (paged, tensor-parallel sharding).
- Tail variance is a sharper symptom than mean ITL for user experience — visible as "chatty starts, then pauses."
- Request count on a decode replica directly inflates per-request ITL via per-step round-robin sharing.
Related metrics¶
- TTFT — first-token cost; prefill-dominated.
- Total generation latency = TTFT + (output_length × ITL).
- Tokens-per-second throughput = effective reciprocal of ITL at load; what Infire's "20% higher tokens per second throughput on unconstrained systems" number refers to.
Caveats¶
- Cloudflare's ~100 ms → 20-30 ms p90 number is relative. No context-length or model-size conditional breakdown.
- Same GPU count is stated but the PD disaggregation refactor may have redistributed those GPUs between prefill and decode roles — the per-role utilisation shift isn't characterised.
- Interaction with speculative decoding, session affinity, and model-size changes not decomposed.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance with specific p90 numbers.