Skip to content

CONCEPT Cited by 1 source

Prefill/decode (PD) disaggregation

Definition

Prefill/decode disaggregation (or PD disaggregation) is an LLM-serving architectural pattern that separates the two stages of LLM inference — prefill (processing the input prompt into a KV cache) and decode (generating output tokens one at a time) — onto different inference servers, tuned and scaled independently. A dedicated load balancer routes each request through prefill first, then transfers the KV cache to a decode server, then streams the generated tokens back to the client.

Canonical wiki instance: Cloudflare Workers AI 2026-04-16 — after shifting to PD disaggregation, p90 intertoken latency dropped from ~100 ms with high variance to 20-30 ms (~3× improvement), using the same quantity of GPUs, while request volume simultaneously increased. P90 TTFT also dropped. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why disaggregate

LLM inference has two structurally different compute regimes:

Stage Regime What it does Resource gated by
Prefill compute-bound processes N input tokens in one parallel forward pass, populates KV cache FLOPs (attention matmul over the whole prompt)
Decode memory-bound generates one output token per forward pass HBM bandwidth (weights loaded per pass, KV reads per step)

From the source:

"Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

When both stages live on one GPU: - When a large prompt is prefilling (compute-bound), HBM bandwidth sits idle. - When ongoing decodes are running (memory-bound), compute sits idle. - New prefill requests block decodes because they compete for the same scheduler — and vice-versa.

Disaggregation turns each stage into a specialisable tier: prefill nodes sized for compute throughput, decode nodes sized for HBM bandwidth + KV capacity. "It allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware."

The load-balancer has to do non-trivial work

Unlike a standard stateless HTTP load balancer, the PD disaggregation balancer must:

  1. Route each request through prefill first, then decode — two routing decisions, not one.
  2. Pass KV-transfer metadata from prefill to decode — different inference engines need different KV-transfer initiation payloads. "Different inference servers require different information to initiate the KV cache transfer."
  3. Rewrite the decode server's streaming response — SSE streams from decode must be augmented with prefill-side metadata (e.g. cached-token counts) before reaching the client. "It must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens."
  4. Do token-aware admission — track in-flight prefill-tokens vs decode-tokens per endpoint separately, because they burn different resources. "The load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly."

These are substantial enough that PD disaggregation is not a drop-in transform — it requires a custom LB.

(Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Measured effect at Workers AI

  • p90 intertoken latency: ~100 ms with high variance → 20-30 ms (~3× improvement)
  • p90 TTFT dropped (graph shown in post; no absolute number)
  • Same GPU count, higher request volume during the transition
  • Significant improvement in tail-latency variance (not just mean)

(Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

KV cache is the cross-boundary artifact

Prefill's output is the KV cache; that's what decode needs to generate. Disaggregation makes KV transfer between nodes a load-bearing operation — it has to be fast enough that TTFT doesn't regress. The standard answer is RDMA over NVLink (intra-node) or NVMe-oF / RoCE / InfiniBand (inter-node); in Cloudflare's stack, Mooncake Transfer Engine is the substrate.

See concepts/rdma-kv-transfer, concepts/kv-cache.

  • Prefill:decode node ratio — bias toward prefill for input-heavy workloads (agentic), toward decode for generation-heavy workloads (creative writing, long-form generation).
  • Heterogeneous hardware — prefill nodes can run on compute-rich older-gen GPUs; decode nodes need HBM capacity + bandwidth. "Run on heterogeneous hardware."
  • KV-transfer transport choice — NVLink vs NVMe-oF per topology.

Open questions

  • Prefill:decode node ratio Cloudflare uses for agentic traffic — not disclosed.
  • KV transfer latency budget vs TTFT — not characterised.
  • Back-pressure from decode-tier saturation back to prefill — mechanism not disclosed.
  • Interaction with speculative decoding — the drafter is typically co-located with the expert; how the drafter runs on a disaggregated decode tier is not discussed.
  • Multi-region hand-off — within-cluster PD disaggregation is described; cross-region case not.

Seen in

Last updated · 200 distilled / 1,178 read