Skip to content

CONCEPT Cited by 1 source

Intertoken latency (ITL / time-per-token)

Definition

Intertoken latency (ITL), also called time-per-output-token (TPOT), is the LLM-serving latency metric measuring the delay between successive output tokens during streamed generation. If TTFT is how long a user waits to see something, ITL is how fast text streams once it starts — the reading-speed metric.

Canonical wiki instance with real numbers: Cloudflare Workers AI 2026-04-16 — p90 ITL: ~100 ms with high variance → 20-30 ms ≈ 3× improvement after shifting to PD disaggregation, using the same quantity of GPUs. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

What it measures

Each decode step produces one token. ITL is the wall-clock time between token t emitted and token t+1 emitted. In streaming, this is the gap between adjacent SSE events. Factors:

  • Per-step decode compute — one forward pass through the model, dominated by HBM bandwidth loading weights + KV cache (memory-bound).
  • KV cache read cost — grows with context length.
  • Queueing — if the replica serves other requests, the per-step slot is shared.
  • Pipeline-stage-bubble cost under pipeline parallelism.
  • Network / SSE-framing overhead between server and client.

Why "memory-bound" matters for ITL

Decode is fundamentally memory-bandwidth-bound: every forward pass re-loads the (sharded) model weights + the (growing) KV cache into compute. The memory-bound regime means:

  • Batching raises utilisation — multiple in-flight requests share weight-loads.
  • Arithmetic intensity is the ceiling; adding compute doesn't help past that.

This is the reason PD disaggregation matters: separating decode from prefill lets decode-side servers be provisioned for HBM bandwidth rather than compute, and eliminates the per-step contention with compute-hungry prefills arriving on the same GPU.

Cloudflare's 2026-04 result

"p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

  • Same GPU count.
  • Request volume increased during the window.
  • Variance dropped — high-variance before meant some users saw much worse than p90; after PD disaggregation that spread shrank.

Speculative decoding's effect on ITL

Speculative decoding with a drafter (EAGLE-3 in Cloudflare's case) doesn't change the per-forward-pass latency directly, but changes the ratio of accepted tokens per expert-forward-pass. If the drafter has a 3:1 acceptance ratio, three output tokens are emitted per expert forward-pass — effectively reducing ITL by the same factor.

Cloudflare does not disclose the acceptance ratio, so the ITL contribution of speculative decoding specifically is not quantified.

Design implications

  • Long contexts inflate ITL via KV-cache-read cost — mitigated by efficient KV-cache layout (paged, tensor-parallel sharding).
  • Tail variance is a sharper symptom than mean ITL for user experience — visible as "chatty starts, then pauses."
  • Request count on a decode replica directly inflates per-request ITL via per-step round-robin sharing.
  • TTFT — first-token cost; prefill-dominated.
  • Total generation latency = TTFT + (output_length × ITL).
  • Tokens-per-second throughput = effective reciprocal of ITL at load; what Infire's "20% higher tokens per second throughput on unconstrained systems" number refers to.

Caveats

  • Cloudflare's ~100 ms → 20-30 ms p90 number is relative. No context-length or model-size conditional breakdown.
  • Same GPU count is stated but the PD disaggregation refactor may have redistributed those GPUs between prefill and decode roles — the per-role utilisation shift isn't characterised.
  • Interaction with speculative decoding, session affinity, and model-size changes not decomposed.

Seen in

Last updated · 200 distilled / 1,178 read