CONCEPT Cited by 1 source
Prefix-aware routing (LLM inference)¶
Definition¶
Prefix-aware routing is the inference-request routing strategy that sends requests sharing a common prompt prefix to the same model-serving replica, so the KV cache computed for the shared prefix on the first request can be reused on subsequent requests — instead of every replica independently recomputing the prefix's K/V tensors.
It is a specialisation of concepts/workload-aware-routing for LLM-inference workloads: the routing feature is the prefix hash of the request's prompt, and the backend-fit criterion is does this replica already have the prefix's KV state hot.
Why the prefix is the right feature¶
LLM serving traffic has a characteristic distribution:
- System prompts are shared across all requests to a product (a code-assistant, a chat agent, a document-Q&A feature) — thousands of tokens of identical prefix per request.
- Session / conversation context is identical across the multi-turn requests in one session — every follow-up request shares the full prior conversation as prefix.
- Few-shot exemplars are reused across many distinct queries in the same application path.
Across these cases, the first N tokens of incoming requests repeat at very high rates across independent requests. The KV cache for those N tokens is expensive to compute (full forward pass) but trivially reusable if the serving replica can find it in-cache.
Shape-agnostic routing (round-robin, least-connections) puts same-prefix requests on different replicas; each replica rebuilds the prefix KV from scratch. Prefix-aware routing puts same-prefix requests on the same replica; the second request onward hits the cache.
Three strategies the HyperPod Inference Operator documents¶
"The installation automatically configures intelligent routing capabilities with multiple strategies (prefix-aware, KV-aware, round-robin) to maximize cache efficiency and minimize inference latency based on workload characteristics."
- Prefix-aware — hash the request's prompt prefix, route via consistent hashing to a replica. Cheap to implement (hash computation + routing table lookup), doesn't require telemetry from replicas. Per-replica cache locality correlates with prefix locality of reference — as long as the same prefix keeps arriving, it stays resident.
- KV-aware — the router reads actual KV-cache occupancy telemetry from each replica (which prefixes are hot right now) and routes to the replica that currently has the best prefix match. More expressive than prefix-hash — it handles eviction correctly (if a prefix got evicted, the router knows and doesn't keep sending there). Requires a replica → router telemetry channel.
- Round-robin — cache-agnostic baseline. Useful as the comparison point and as a fallback when cache locality is irrelevant (e.g. every request has a unique long prompt with no shared prefix).
The operator picks one strategy at install time based on expected workload characteristics; the post names the three but discloses no internals on the algorithm, backpressure behaviour, or fallback when prefix-aware tips a replica into OOM.
Consistent-hashing realisation (the standard shape)¶
The typical implementation is consistent hashing over the prefix:
- Hash the first N tokens of the request (or the first N tokens that exceed a threshold of shared-prefix probability).
- Look up the replica on the hash ring.
- Route.
Consistent hashing means:
- Same prefix → same replica (cache hit).
- Replica failure / add / remove moves only O(1/N) prefixes to other replicas, not the full traffic set.
- Replica load is naturally distributed because the prefix space is usually much larger than the replica set.
Known tension: a hot prefix (one system prompt shared across the entire application) funnels all traffic to one replica, which will tip it over. Production implementations combine prefix hashing with overflow / shadow replicas or hot-prefix detection + replication — neither is disclosed in the announcement post.
The ceiling: prefix-aware routing only pays off if KV cache is¶
reusable
Prefix-aware routing assumes the replica can actually retain the prefix's KV state across requests. Two things go wrong:
- No KV-cache-reuse support in the serving library → cache is discarded between requests; routing same-prefix requests to the same replica buys nothing.
- HBM too small for the prefix working set → LRU evicts the prefix between requests; again, same-replica routing buys nothing.
This is why managed tiered KV cache and prefix-aware routing ship together in the HyperPod Inference Operator — tiering makes the prefix survive across requests (fallback to DRAM / NVMe instead of eviction); routing ensures the prefix's second consumer hits the same replica that has it hot.
Comparison with other routing strategies¶
| Strategy | Routing feature | When it wins |
|---|---|---|
| Round-robin / least-conn | None | Stateless workloads; no cache to amortise |
| Consistent hashing (session-sticky) | Session / user ID | Per-user state on the server (classic web session) |
| Workload-aware | Query shape (tables, body text, source header) | Heterogeneous backend clusters tuned for different query shapes |
| Prefix-aware (this) | Prompt prefix hash | LLM inference with shared-prefix traffic |
| KV-aware | Replica cache-occupancy telemetry | LLM inference with unpredictable prefix distribution |
Prefix-aware routing shares lineage with session-sticky routing — both use a consistent-hash key to pin request → replica. The key difference is that prefix stickiness is value-based (compute the key from payload content), not identity-based (use a session cookie). Two different sessions with the same system prompt route to the same replica — deliberately.
Seen in¶
- sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod — sole source at time of writing. Names the three strategies (prefix-aware / KV-aware / round-robin) as install-time options; no algorithm internals disclosed.
Related¶
- concepts/kv-cache — the primitive that makes prefix-aware routing pay off.
- concepts/workload-aware-routing — the general pattern this specialises.
- concepts/locality-aware-scheduling — the analogous primitive at the scheduler layer (place task near data); prefix-aware routing is the analogue at the request-router layer (place request near cache).
- systems/sagemaker-hyperpod-inference-operator — the canonical production consumer.