Skip to content

CONCEPT Cited by 1 source

Token-aware load balancing (LLM serving)

Definition

Token-aware load balancing is the admission-control / routing primitive for LLM-serving load balancers in which the balancer's per-endpoint load estimator tracks in-flight tokens (separately for prefill and decode, typically), not in-flight requests. Each candidate endpoint's load is the sum of token-counts across requests currently routed to it but not yet completed; the balancer distributes new requests so that per-endpoint token load is roughly even.

Canonical wiki instance: Cloudflare Workers AI 2026-04-16 in the PD disaggregation load balancer. "We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly." (Source: sources/2026-04-16-cloudflare-building-the-foundation-for-running-extra-large-language-models)

Why token-count is the right feature, not request-count

LLM serving traffic has extreme per-request variance on the load a single request puts on a server:

  • Prefill cost scales with prompt length; short prompts and hundred-thousand-token prompts on the same endpoint differ by three orders of magnitude in compute.
  • Decode cost per step is similar across requests (one token per step), but decode duration scales with output length, and KV-cache footprint scales with context length.

A request-count-based balancer (e.g. round-robin or least-connections) treats both extremes equally, producing catastrophic skew when large prompts land unevenly. Token-count-based tracking makes the load estimate proportional to the actual GPU work.

This is the structural companion to token-count-based batching on the inference engine itself: the engine admits up to a token budget per batch; the load balancer routes up to a token budget per endpoint.

Separate prefill and decode token counts under PD disaggregation

Under PD disaggregation, the balancer maintains two separate token-load estimates per endpoint — prefill-tokens-in-flight and decode-tokens-in-flight — because the two stages burn different resources (compute-bound vs memory-bound).

A single request contributes: - Its prompt-length worth of prefill-tokens to a prefill endpoint - Its context-length worth of decode-tokens to a decode endpoint (typically the context accumulates per step)

Routing decisions for prefill and decode are made against the respective pools.

Relation to other admission primitives

Primitive Axis Side
Round-robin / least-connections request count LB
Token-count batching token count inference engine / admission into forward pass
Token-aware load balancing token count LB / routing across endpoints
Session affinity prefix-hash / client hint LB / routing for cache-hit
Prefix-aware routing prompt prefix hash LB / routing for KV-reuse

Token-aware LB is orthogonal to session affinity: session affinity says which endpoint is the preferred target (to hit its warm cache); token-aware LB says how loaded each endpoint currently is (to avoid pile-on). A well-tuned LB combines the two — prefer the session-affinity target unless it's over the token budget, then fall back.

Load-balancer responsibilities (Cloudflare instance)

From the source, the PD disaggregation LB at Workers AI does all of: 1. Token-aware load balancing (this page). 2. Two-hop routing: prefill → decode per request. 3. KV-transfer metadata passing: prefill's KV-transfer init payload is translated for the decode server. 4. Response rewrite: decode's SSE stream augmented with prefill-side metadata (cached-token counts) before reaching the client.

See concepts/prefill-decode-disaggregation for the full scope.

Caveats

  • Cloudflare post doesn't disclose the exact token-load scoring function — whether it's a simple sum, queue-depth-adjusted, or EWMA-based.
  • Interaction with session affinity not detailed — whether affinity is a hard or soft preference under token-load imbalance.
  • Per-endpoint token-budget caps not disclosed — whether admission blocks or spills to other endpoints under overload.
  • Correctness of the token-count estimate during a request lifecycle — when tokens are "subtracted" from a decode endpoint's load (on per-token emission? on request completion?) not characterised.

Seen in

Last updated · 200 distilled / 1,178 read