Skip to content

CONCEPT Cited by 1 source

Time to first token (TTFT)

Definition

Time to first token (TTFT) is the LLM-serving latency metric measuring the delay between a request arriving at the server and the first output token being emitted back to the client. TTFT dominates user-perceived responsiveness for streaming LLM interactions — the user sees something happening as soon as the first token appears; delay before that is a conversation "pause."

Canonical wiki instance at the serving-stack level: Cloudflare Workers AI reports p90 TTFT dropped after shifting to PD disaggregation, using the same GPU count. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

What goes into TTFT

TTFT = time from request admission through completion of the first forward pass that emits token[0]. Breakdown:

  1. Admission + routing latency — LB decision, connection establishment.
  2. Request-transfer time — prompt payload over the network.
  3. KV cache preparation:
  4. If a warm cache is hit (e.g. via session affinity or prefix routing): cheap.
  5. If cold: a full prefill pass over the prompt (compute-bound, scales with prompt length).
  6. First decode forward pass — produces token[0].
  7. Response network transfer — first token back to client.

Prefill dominates cold-prompt TTFT. A 10 K-token prompt can take seconds to prefill on a sizeable model, making that one step the entire TTFT budget.

Why it matters for agent workloads

Agent workloads have large, growing prompts (system prompt + tools + conversation history + MCP server metadata). Cold-cache TTFT scales with prompt length; every missed KV-cache-hit turns a snappy turn into a long pause.

Cloudflare's three TTFT-oriented optimisations:

  • PD disaggregation — prefill on a node tuned for compute; decode on a separate node tuned for memory bandwidth. Eliminates prefill-vs-decode contention so prefill latency variance drops.
  • Session affinity — warm-KV reuse across turns; the 2nd+ turns of a session pay a short-delta-prefill budget instead of a full-prefix-prefill budget.
  • Multi-node shared KV cache (Mooncake Transfer Engine + Mooncake Store + LMCache) — extends warm-cache residency, more turns hit.

The Cloudflare 2026-04 result

"Here's a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

  • P90 TTFT dropped (graph only — no absolute number disclosed).
  • Same GPU count.
  • Request volume increased during the window.
  • Tail-latency variance reduced — an independent metric beyond the mean.

Tail latency variance as the real win

TTFT's p50 is often tolerable; p99 under contention is what makes agents feel broken. PD disaggregation's biggest effect in the Workers AI result was lower variance in the tail, not just a lower mean — reflects that single-machine prefill + decode produced contention spikes that disaggregation eliminates. See concepts/tail-latency-at-scale.

Companion metric

  • Intertoken latency (ITL) — the between-token gap during generation. TTFT is the first-token cost; ITL is the per-token cost afterward. Cloudflare's 2026-04 numbers: p90 ITL ~100 ms (high variance) → 20-30 ms (3× improvement).

Caveats

  • Cloudflare's graphs are not raw numbers. No absolute p90 TTFT is disclosed.
  • Workload / prompt-length distribution for the measurement is not specified.
  • TTFT is workload-dependent — heavily influenced by prompt length, which differs wildly across production traffic.

Seen in

Last updated · 200 distilled / 1,178 read