CONCEPT Cited by 1 source

Token-aware load balancing (LLM serving)¶

Definition¶

Token-aware load balancing is the admission-control / routing primitive for LLM-serving load balancers in which the balancer's per-endpoint load estimator tracks in-flight tokens (separately for prefill and decode, typically), not in-flight requests. Each candidate endpoint's load is the sum of token-counts across requests currently routed to it but not yet completed; the balancer distributes new requests so that per-endpoint token load is roughly even.

Canonical wiki instance: Cloudflare Workers AI 2026-04-16 in the PD disaggregation load balancer. "We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why token-count is the right feature, not request-count¶

LLM serving traffic has extreme per-request variance on the load a single request puts on a server:

Prefill cost scales with prompt length; short prompts and hundred-thousand-token prompts on the same endpoint differ by three orders of magnitude in compute.
Decode cost per step is similar across requests (one token per step), but decode duration scales with output length, and KV-cache footprint scales with context length.

A request-count-based balancer (e.g. round-robin or least-connections) treats both extremes equally, producing catastrophic skew when large prompts land unevenly. Token-count-based tracking makes the load estimate proportional to the actual GPU work.

This is the structural companion to token-count-based batching on the inference engine itself: the engine admits up to a token budget per batch; the load balancer routes up to a token budget per endpoint.

Separate prefill and decode token counts under PD disaggregation ¶

Under PD disaggregation, the balancer maintains two separate token-load estimates per endpoint — prefill-tokens-in-flight and decode-tokens-in-flight — because the two stages burn different resources (compute-bound vs memory-bound).

A single request contributes: - Its prompt-length worth of prefill-tokens to a prefill endpoint - Its context-length worth of decode-tokens to a decode endpoint (typically the context accumulates per step)

Routing decisions for prefill and decode are made against the respective pools.

Relation to other admission primitives¶

Primitive	Axis	Side
Round-robin / least-connections	request count	LB
Token-count batching	token count	inference engine / admission into forward pass
Token-aware load balancing	token count	LB / routing across endpoints
Session affinity	prefix-hash / client hint	LB / routing for cache-hit
Prefix-aware routing	prompt prefix hash	LB / routing for KV-reuse

Token-aware LB is orthogonal to session affinity: session affinity says which endpoint is the preferred target (to hit its warm cache); token-aware LB says how loaded each endpoint currently is (to avoid pile-on). A well-tuned LB combines the two — prefer the session-affinity target unless it's over the token budget, then fall back.

Load-balancer responsibilities (Cloudflare instance)¶

From the source, the PD disaggregation LB at Workers AI does all of: 1. Token-aware load balancing (this page). 2. Two-hop routing: prefill → decode per request. 3. KV-transfer metadata passing: prefill's KV-transfer init payload is translated for the decode server. 4. Response rewrite: decode's SSE stream augmented with prefill-side metadata (cached-token counts) before reaching the client.

See concepts/prefill-decode-disaggregation for the full scope.

Caveats¶

Cloudflare post doesn't disclose the exact token-load scoring function — whether it's a simple sum, queue-depth-adjusted, or EWMA-based.
Interaction with session affinity not detailed — whether affinity is a hard or soft preference under token-load imbalance.
Per-endpoint token-budget caps not disclosed — whether admission blocks or spills to other endpoints under overload.
Correctness of the token-count estimate during a request lifecycle — when tokens are "subtracted" from a decode endpoint's load (on per-token emission? on request completion?) not characterised.

Seen in¶

sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models — named, scoped to prefill-tokens / decode-tokens as separate pools; companion primitive in the PD disaggregation load balancer.

concepts/prefill-decode-disaggregation — the architectural context.
concepts/token-count-based-batching — sibling on the inference engine side of the same axis.
concepts/kv-cache — the resource being protected.
concepts/session-affinity-prompt-caching / concepts/prefix-aware-routing — orthogonal routing axes.
concepts/workload-aware-routing — the broader pattern family.
systems/workers-ai
patterns/disaggregated-inference-stages
companies/cloudflare