Skip to content

PATTERN Cited by 1 source

Disaggregated inference stages

Pattern

Disaggregated inference stages is the serving-side architectural pattern of splitting distinct stages of a multi-stage inference pipeline onto separate tiers of servers, each tuned and scaled independently for its stage's compute characteristics, with a stage-aware load balancer routing requests through the stages and transferring per-stage state (notably the KV cache) between them.

The canonical wiki instance is prefill/decode disaggregation — LLM inference split into a prefill tier (processes input → populates KV cache, compute-bound) and a decode tier (generates output tokens, memory-bound), running on potentially different hardware in the same cluster.

From the source (Cloudflare Workers AI 2026-04-16):

"There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. … Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Preconditions

For the pattern to apply:

  1. The workload has structurally-distinct stages with different resource profiles (compute vs memory, CPU vs GPU, SSD vs DRAM).
  2. Co-locating the stages leads to contention — the stages block each other or waste the GPU's non-bottleneck resources.
  3. A per-stage state artifact exists that can be transferred cheaply enough to not blow the end-to-end latency budget (for LLMs: KV cache via RDMA).
  4. Load shape varies such that independent scaling pays off (input-heavy vs output-heavy traffic in the LLM case).

Structure

        ┌────────────────────┐
client →│ Stage-aware LB     │
        │ (token-aware,      │
        │  response rewrite) │
        └─┬─────────┬─────┬──┘
         1↓        2↓     │ (streaming)
    ┌──────┐   ┌────────┐ │
    │Stage1│──→│Stage2  │→┘
    │tier  │KV │tier    │
    └──────┘   └────────┘
       ↑           ↑
       │           │
    sized for   sized for
    compute     memory BW

The load balancer is non-trivial — it does:

  1. Two-hop routing — stage 1 first, stage 2 second (not just round-robin).
  2. Stage-state transfer coordination — different engines require different KV-transfer init payloads.
  3. Response rewrite — combines stage-1 metadata (cached-token counts) into stage-2's streaming response before reaching the client.
  4. Token-aware admission — tracks in-flight tokens per stage pool separately, because the resources are different.

See concepts/token-aware-load-balancing for the companion primitive.

Benefits

  • Independent scaling — bias prefill:decode node ratio for the observed workload (input-heavy for agentic, output-heavy for long-form generation).
  • Independent tuning — per-stage batch size, KV allocation, memory policies.
  • Heterogeneous hardware — stage 1 can run on compute-rich older GPUs, stage 2 on newer GPUs with more HBM + bandwidth; "it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware."
  • Reduced contention variance — co-located stages block each other unpredictably; disaggregation eliminates that variance source.

Cloudflare's measured result: p90 TTFT dropped, p90 ITL ~100 ms → 20-30 ms (3× improvement), same GPU count, higher request volume, significant improvement in tail-latency variance. (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Costs

  • LB complexity — response rewrite across stages + state-transfer coordination + token-aware admission is substantial new code, not a drop-in transform.
  • State-transfer budget — KV transfer between prefill and decode adds to TTFT; requires high-bandwidth interconnect (RDMA — NVLink intra-node, InfiniBand / RoCE / NVMe-oF inter-node) to stay under budget.
  • Operational surface area — two tiers of servers to operate, monitor, version, and fail-over independently.
  • Cold-start amplification — both tiers have to be ready; a single-tier outage can cascade.

When to use

  • Long-prompt / input-heavy workloads (agents, RAG, summarisation): the prefill-vs-decode contention is acute.
  • Extra-large models where single-machine placement already requires multi-GPU coordination — PD disaggregation becomes a natural extension.
  • Workload shape varies unpredictably — independent scaling saves over-provisioning.

When not to use

  • Small models where a single replica handles prefill + decode without contention — PD disaggregation adds LB complexity without meaningful return.
  • Short prompts, long outputs — prefill is cheap, contention doesn't materialise.
  • Low-bandwidth topologies — without RDMA, inter-stage KV transfer cost dominates, negating the win.

Generalisation beyond LLM inference

The pattern applies broadly to multi-stage compute pipelines with per-stage resource-profile asymmetry:

  • Retrieval-then-rank search pipelines (BM25 / vector candidate generation = I/O bound; cross-encoder rerank = compute bound).
  • Encode-then-decode in speech / translation models.
  • Admission-then-worker in classical serving (nginx admission + backend worker pools).

Each has the same shape: stage-distinct resource needs + cheap inter-stage state transfer + independent scaling opportunity.

Seen in

Last updated · 200 distilled / 1,178 read