PATTERN Cited by 1 source
Disaggregated inference stages¶
Pattern¶
Disaggregated inference stages is the serving-side architectural pattern of splitting distinct stages of a multi-stage inference pipeline onto separate tiers of servers, each tuned and scaled independently for its stage's compute characteristics, with a stage-aware load balancer routing requests through the stages and transferring per-stage state (notably the KV cache) between them.
The canonical wiki instance is prefill/decode disaggregation — LLM inference split into a prefill tier (processes input → populates KV cache, compute-bound) and a decode tier (generates output tokens, memory-bound), running on potentially different hardware in the same cluster.
From the source (Cloudflare Workers AI 2026-04-16):
"There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. … Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Preconditions¶
For the pattern to apply:
- The workload has structurally-distinct stages with different resource profiles (compute vs memory, CPU vs GPU, SSD vs DRAM).
- Co-locating the stages leads to contention — the stages block each other or waste the GPU's non-bottleneck resources.
- A per-stage state artifact exists that can be transferred cheaply enough to not blow the end-to-end latency budget (for LLMs: KV cache via RDMA).
- Load shape varies such that independent scaling pays off (input-heavy vs output-heavy traffic in the LLM case).
Structure¶
┌────────────────────┐
client →│ Stage-aware LB │
│ (token-aware, │
│ response rewrite) │
└─┬─────────┬─────┬──┘
1↓ 2↓ │ (streaming)
┌──────┐ ┌────────┐ │
│Stage1│──→│Stage2 │→┘
│tier │KV │tier │
└──────┘ └────────┘
↑ ↑
│ │
sized for sized for
compute memory BW
The load balancer is non-trivial — it does:
- Two-hop routing — stage 1 first, stage 2 second (not just round-robin).
- Stage-state transfer coordination — different engines require different KV-transfer init payloads.
- Response rewrite — combines stage-1 metadata (cached-token counts) into stage-2's streaming response before reaching the client.
- Token-aware admission — tracks in-flight tokens per stage pool separately, because the resources are different.
See concepts/token-aware-load-balancing for the companion primitive.
Benefits¶
- Independent scaling — bias prefill:decode node ratio for the observed workload (input-heavy for agentic, output-heavy for long-form generation).
- Independent tuning — per-stage batch size, KV allocation, memory policies.
- Heterogeneous hardware — stage 1 can run on compute-rich older GPUs, stage 2 on newer GPUs with more HBM + bandwidth; "it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware."
- Reduced contention variance — co-located stages block each other unpredictably; disaggregation eliminates that variance source.
Cloudflare's measured result: p90 TTFT dropped, p90 ITL ~100 ms → 20-30 ms (3× improvement), same GPU count, higher request volume, significant improvement in tail-latency variance. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)
Costs¶
- LB complexity — response rewrite across stages + state-transfer coordination + token-aware admission is substantial new code, not a drop-in transform.
- State-transfer budget — KV transfer between prefill and decode adds to TTFT; requires high-bandwidth interconnect (RDMA — NVLink intra-node, InfiniBand / RoCE / NVMe-oF inter-node) to stay under budget.
- Operational surface area — two tiers of servers to operate, monitor, version, and fail-over independently.
- Cold-start amplification — both tiers have to be ready; a single-tier outage can cascade.
When to use¶
- Long-prompt / input-heavy workloads (agents, RAG, summarisation): the prefill-vs-decode contention is acute.
- Extra-large models where single-machine placement already requires multi-GPU coordination — PD disaggregation becomes a natural extension.
- Workload shape varies unpredictably — independent scaling saves over-provisioning.
When not to use¶
- Small models where a single replica handles prefill + decode without contention — PD disaggregation adds LB complexity without meaningful return.
- Short prompts, long outputs — prefill is cheap, contention doesn't materialise.
- Low-bandwidth topologies — without RDMA, inter-stage KV transfer cost dominates, negating the win.
Generalisation beyond LLM inference¶
The pattern applies broadly to multi-stage compute pipelines with per-stage resource-profile asymmetry:
- Retrieval-then-rank search pipelines (BM25 / vector candidate generation = I/O bound; cross-encoder rerank = compute bound).
- Encode-then-decode in speech / translation models.
- Admission-then-worker in classical serving (nginx admission + backend worker pools).
Each has the same shape: stage-distinct resource needs + cheap inter-stage state transfer + independent scaling opportunity.
Seen in¶
- sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance; Cloudflare Workers AI PD disaggregation for Kimi K2.5 and other large models on Infire.
Related¶
- concepts/prefill-decode-disaggregation — the concept behind the LLM-specific realisation.
- concepts/token-aware-load-balancing — companion admission primitive.
- concepts/kv-cache / concepts/rdma-kv-transfer — the state-transfer mechanism.
- concepts/memory-bound-vs-compute-bound — the roofline framing.
- patterns/independent-scaling-tiers — parent pattern family.
- systems/workers-ai / systems/infire / systems/mooncake-transfer-engine
- companies/cloudflare