PATTERN Cited by 1 source

Disaggregated inference stages¶

Pattern¶

Disaggregated inference stages is the serving-side architectural pattern of splitting distinct stages of a multi-stage inference pipeline onto separate tiers of servers, each tuned and scaled independently for its stage's compute characteristics, with a stage-aware load balancer routing requests through the stages and transferring per-stage state (notably the KV cache) between them.

The canonical wiki instance is prefill/decode disaggregation — LLM inference split into a prefill tier (processes input → populates KV cache, compute-bound) and a decode tier (generates output tokens, memory-bound), running on potentially different hardware in the same cluster.

From the source (Cloudflare Workers AI 2026-04-16):

"There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. … Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Preconditions¶

For the pattern to apply:

The workload has structurally-distinct stages with different resource profiles (compute vs memory, CPU vs GPU, SSD vs DRAM).
Co-locating the stages leads to contention — the stages block each other or waste the GPU's non-bottleneck resources.
A per-stage state artifact exists that can be transferred cheaply enough to not blow the end-to-end latency budget (for LLMs: KV cache via RDMA).
Load shape varies such that independent scaling pays off (input-heavy vs output-heavy traffic in the LLM case).

Structure¶

        ┌────────────────────┐
client →│ Stage-aware LB     │
        │ (token-aware,      │
        │  response rewrite) │
        └─┬─────────┬─────┬──┘
         1↓        2↓     │ (streaming)
    ┌──────┐   ┌────────┐ │
    │Stage1│──→│Stage2  │→┘
    │tier  │KV │tier    │
    └──────┘   └────────┘
       ↑           ↑
       │           │
    sized for   sized for
    compute     memory BW

The load balancer is non-trivial — it does:

Two-hop routing — stage 1 first, stage 2 second (not just round-robin).
Stage-state transfer coordination — different engines require different KV-transfer init payloads.
Response rewrite — combines stage-1 metadata (cached-token counts) into stage-2's streaming response before reaching the client.
Token-aware admission — tracks in-flight tokens per stage pool separately, because the resources are different.

See concepts/token-aware-load-balancing for the companion primitive.

Benefits¶

Independent scaling — bias prefill:decode node ratio for the observed workload (input-heavy for agentic, output-heavy for long-form generation).
Independent tuning — per-stage batch size, KV allocation, memory policies.
Heterogeneous hardware — stage 1 can run on compute-rich older GPUs, stage 2 on newer GPUs with more HBM + bandwidth; "it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware."
Reduced contention variance — co-located stages block each other unpredictably; disaggregation eliminates that variance source.

Cloudflare's measured result: p90 TTFT dropped, p90 ITL ~100 ms → 20-30 ms (3× improvement), same GPU count, higher request volume, significant improvement in tail-latency variance. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Costs¶

LB complexity — response rewrite across stages + state-transfer coordination + token-aware admission is substantial new code, not a drop-in transform.
State-transfer budget — KV transfer between prefill and decode adds to TTFT; requires high-bandwidth interconnect (RDMA — NVLink intra-node, InfiniBand / RoCE / NVMe-oF inter-node) to stay under budget.
Operational surface area — two tiers of servers to operate, monitor, version, and fail-over independently.
Cold-start amplification — both tiers have to be ready; a single-tier outage can cascade.

When to use¶

Long-prompt / input-heavy workloads (agents, RAG, summarisation): the prefill-vs-decode contention is acute.
Extra-large models where single-machine placement already requires multi-GPU coordination — PD disaggregation becomes a natural extension.
Workload shape varies unpredictably — independent scaling saves over-provisioning.

When not to use¶

Small models where a single replica handles prefill + decode without contention — PD disaggregation adds LB complexity without meaningful return.
Short prompts, long outputs — prefill is cheap, contention doesn't materialise.
Low-bandwidth topologies — without RDMA, inter-stage KV transfer cost dominates, negating the win.

Generalisation beyond LLM inference¶

The pattern applies broadly to multi-stage compute pipelines with per-stage resource-profile asymmetry:

Retrieval-then-rank search pipelines (BM25 / vector candidate generation = I/O bound; cross-encoder rerank = compute bound).
Encode-then-decode in speech / translation models.
Admission-then-worker in classical serving (nginx admission + backend worker pools).

Each has the same shape: stage-distinct resource needs + cheap inter-stage state transfer + independent scaling opportunity.

Seen in¶

sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — canonical wiki instance; Cloudflare Workers AI PD disaggregation for Kimi K2.5 and other large models on Infire.

concepts/prefill-decode-disaggregation — the concept behind the LLM-specific realisation.
concepts/token-aware-load-balancing — companion admission primitive.
concepts/kv-cache / concepts/rdma-kv-transfer — the state-transfer mechanism.
concepts/memory-bound-vs-compute-bound — the roofline framing.
patterns/independent-scaling-tiers — parent pattern family.
systems/workers-ai / systems/infire / systems/mooncake-transfer-engine
companies/cloudflare