CONCEPT Cited by 1 source

Prefix-aware routing (LLM inference)¶

Definition¶

Prefix-aware routing is the inference-request routing strategy that sends requests sharing a common prompt prefix to the same model-serving replica, so the KV cache computed for the shared prefix on the first request can be reused on subsequent requests — instead of every replica independently recomputing the prefix's K/V tensors.

It is a specialisation of concepts/workload-aware-routing for LLM-inference workloads: the routing feature is the prefix hash of the request's prompt, and the backend-fit criterion is does this replica already have the prefix's KV state hot.

Why the prefix is the right feature¶

LLM serving traffic has a characteristic distribution:

System prompts are shared across all requests to a product (a code-assistant, a chat agent, a document-Q&A feature) — thousands of tokens of identical prefix per request.
Session / conversation context is identical across the multi-turn requests in one session — every follow-up request shares the full prior conversation as prefix.
Few-shot exemplars are reused across many distinct queries in the same application path.

Across these cases, the first N tokens of incoming requests repeat at very high rates across independent requests. The KV cache for those N tokens is expensive to compute (full forward pass) but trivially reusable if the serving replica can find it in-cache.

Shape-agnostic routing (round-robin, least-connections) puts same-prefix requests on different replicas; each replica rebuilds the prefix KV from scratch. Prefix-aware routing puts same-prefix requests on the same replica; the second request onward hits the cache.

Three strategies the HyperPod Inference Operator documents¶

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

"The installation automatically configures intelligent routing capabilities with multiple strategies (prefix-aware, KV-aware, round-robin) to maximize cache efficiency and minimize inference latency based on workload characteristics."

Prefix-aware — hash the request's prompt prefix, route via consistent hashing to a replica. Cheap to implement (hash computation + routing table lookup), doesn't require telemetry from replicas. Per-replica cache locality correlates with prefix locality of reference — as long as the same prefix keeps arriving, it stays resident.
KV-aware — the router reads actual KV-cache occupancy telemetry from each replica (which prefixes are hot right now) and routes to the replica that currently has the best prefix match. More expressive than prefix-hash — it handles eviction correctly (if a prefix got evicted, the router knows and doesn't keep sending there). Requires a replica → router telemetry channel.
Round-robin — cache-agnostic baseline. Useful as the comparison point and as a fallback when cache locality is irrelevant (e.g. every request has a unique long prompt with no shared prefix).

The operator picks one strategy at install time based on expected workload characteristics; the post names the three but discloses no internals on the algorithm, backpressure behaviour, or fallback when prefix-aware tips a replica into OOM.

Consistent-hashing realisation (the standard shape)¶

The typical implementation is consistent hashing over the prefix:

Hash the first N tokens of the request (or the first N tokens that exceed a threshold of shared-prefix probability).
Look up the replica on the hash ring.
Route.

Consistent hashing means:

Same prefix → same replica (cache hit).
Replica failure / add / remove moves only O(1/N) prefixes to other replicas, not the full traffic set.
Replica load is naturally distributed because the prefix space is usually much larger than the replica set.

Known tension: a hot prefix (one system prompt shared across the entire application) funnels all traffic to one replica, which will tip it over. Production implementations combine prefix hashing with overflow / shadow replicas or hot-prefix detection + replication — neither is disclosed in the announcement post.

The ceiling: prefix-aware routing only pays off if KV cache is¶

reusable

Prefix-aware routing assumes the replica can actually retain the prefix's KV state across requests. Two things go wrong:

No KV-cache-reuse support in the serving library → cache is discarded between requests; routing same-prefix requests to the same replica buys nothing.
HBM too small for the prefix working set → LRU evicts the prefix between requests; again, same-replica routing buys nothing.

This is why managed tiered KV cache and prefix-aware routing ship together in the HyperPod Inference Operator — tiering makes the prefix survive across requests (fallback to DRAM / NVMe instead of eviction); routing ensures the prefix's second consumer hits the same replica that has it hot.

Comparison with other routing strategies¶

Strategy	Routing feature	When it wins
Round-robin / least-conn	None	Stateless workloads; no cache to amortise
Consistent hashing (session-sticky)	Session / user ID	Per-user state on the server (classic web session)
Workload-aware	Query shape (tables, body text, source header)	Heterogeneous backend clusters tuned for different query shapes
Prefix-aware (this)	Prompt prefix hash	LLM inference with shared-prefix traffic
KV-aware	Replica cache-occupancy telemetry	LLM inference with unpredictable prefix distribution

Prefix-aware routing shares lineage with session-sticky routing — both use a consistent-hash key to pin request → replica. The key difference is that prefix stickiness is value-based (compute the key from payload content), not identity-based (use a session cookie). Two different sessions with the same system prompt route to the same replica — deliberately.

Seen in¶

sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — sole source at time of writing. Names the three strategies (prefix-aware / KV-aware / round-robin) as install-time options; no algorithm internals disclosed.

concepts/kv-cache — the primitive that makes prefix-aware routing pay off.
concepts/workload-aware-routing — the general pattern this specialises.
concepts/locality-aware-scheduling — the analogous primitive at the scheduler layer (place task near data); prefix-aware routing is the analogue at the request-router layer (place request near cache).
systems/sagemaker-hyperpod-inference-operator — the canonical production consumer.