CONCEPT Cited by 1 source
Time to first token (TTFT)¶
Definition¶
Time to first token (TTFT) is the LLM-serving latency metric measuring the delay between a request arriving at the server and the first output token being emitted back to the client. TTFT dominates user-perceived responsiveness for streaming LLM interactions — the user sees something happening as soon as the first token appears; delay before that is a conversation "pause."
Canonical wiki instance at the serving-stack level: Cloudflare Workers AI reports p90 TTFT dropped after shifting to PD disaggregation, using the same GPU count. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
What goes into TTFT¶
TTFT = time from request admission through completion of the first forward pass that emits token[0]. Breakdown:
- Admission + routing latency — LB decision, connection establishment.
- Request-transfer time — prompt payload over the network.
- KV cache preparation:
- If a warm cache is hit (e.g. via session affinity or prefix routing): cheap.
- If cold: a full prefill pass over the prompt (compute-bound, scales with prompt length).
- First decode forward pass — produces token[0].
- Response network transfer — first token back to client.
Prefill dominates cold-prompt TTFT. A 10 K-token prompt can take seconds to prefill on a sizeable model, making that one step the entire TTFT budget.
Why it matters for agent workloads¶
Agent workloads have large, growing prompts (system prompt + tools + conversation history + MCP server metadata). Cold-cache TTFT scales with prompt length; every missed KV-cache-hit turns a snappy turn into a long pause.
Cloudflare's three TTFT-oriented optimisations:
- PD disaggregation — prefill on a node tuned for compute; decode on a separate node tuned for memory bandwidth. Eliminates prefill-vs-decode contention so prefill latency variance drops.
- Session affinity — warm-KV reuse across turns; the 2nd+ turns of a session pay a short-delta-prefill budget instead of a full-prefix-prefill budget.
- Multi-node shared KV cache (Mooncake Transfer Engine + Mooncake Store + LMCache) — extends warm-cache residency, more turns hit.
The Cloudflare 2026-04 result¶
"Here's a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
- P90 TTFT dropped (graph only — no absolute number disclosed).
- Same GPU count.
- Request volume increased during the window.
- Tail-latency variance reduced — an independent metric beyond the mean.
Tail latency variance as the real win¶
TTFT's p50 is often tolerable; p99 under contention is what makes agents feel broken. PD disaggregation's biggest effect in the Workers AI result was lower variance in the tail, not just a lower mean — reflects that single-machine prefill + decode produced contention spikes that disaggregation eliminates. See concepts/tail-latency-at-scale.
Companion metric¶
- Intertoken latency (ITL) — the between-token gap during generation. TTFT is the first-token cost; ITL is the per-token cost afterward. Cloudflare's 2026-04 numbers: p90 ITL ~100 ms (high variance) → 20-30 ms (3× improvement).
Caveats¶
- Cloudflare's graphs are not raw numbers. No absolute p90 TTFT is disclosed.
- Workload / prompt-length distribution for the measurement is not specified.
- TTFT is workload-dependent — heavily influenced by prompt length, which differs wildly across production traffic.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance.