Skip to content

CONCEPT Cited by 1 source

Prefix-aware routing (LLM inference)

Definition

Prefix-aware routing is the inference-request routing strategy that sends requests sharing a common prompt prefix to the same model-serving replica, so the KV cache computed for the shared prefix on the first request can be reused on subsequent requests — instead of every replica independently recomputing the prefix's K/V tensors.

It is a specialisation of concepts/workload-aware-routing for LLM-inference workloads: the routing feature is the prefix hash of the request's prompt, and the backend-fit criterion is does this replica already have the prefix's KV state hot.

Why the prefix is the right feature

LLM serving traffic has a characteristic distribution:

  • System prompts are shared across all requests to a product (a code-assistant, a chat agent, a document-Q&A feature) — thousands of tokens of identical prefix per request.
  • Session / conversation context is identical across the multi-turn requests in one session — every follow-up request shares the full prior conversation as prefix.
  • Few-shot exemplars are reused across many distinct queries in the same application path.

Across these cases, the first N tokens of incoming requests repeat at very high rates across independent requests. The KV cache for those N tokens is expensive to compute (full forward pass) but trivially reusable if the serving replica can find it in-cache.

Shape-agnostic routing (round-robin, least-connections) puts same-prefix requests on different replicas; each replica rebuilds the prefix KV from scratch. Prefix-aware routing puts same-prefix requests on the same replica; the second request onward hits the cache.

Three strategies the HyperPod Inference Operator documents

(Source: sources/2026-04-06-aws-unlock-efficient-model-deployment-simplified-inference-operator-setup-on-amazon-sagemaker-hyperpod)

"The installation automatically configures intelligent routing capabilities with multiple strategies (prefix-aware, KV-aware, round-robin) to maximize cache efficiency and minimize inference latency based on workload characteristics."

  1. Prefix-aware — hash the request's prompt prefix, route via consistent hashing to a replica. Cheap to implement (hash computation + routing table lookup), doesn't require telemetry from replicas. Per-replica cache locality correlates with prefix locality of reference — as long as the same prefix keeps arriving, it stays resident.
  2. KV-aware — the router reads actual KV-cache occupancy telemetry from each replica (which prefixes are hot right now) and routes to the replica that currently has the best prefix match. More expressive than prefix-hash — it handles eviction correctly (if a prefix got evicted, the router knows and doesn't keep sending there). Requires a replica → router telemetry channel.
  3. Round-robin — cache-agnostic baseline. Useful as the comparison point and as a fallback when cache locality is irrelevant (e.g. every request has a unique long prompt with no shared prefix).

The operator picks one strategy at install time based on expected workload characteristics; the post names the three but discloses no internals on the algorithm, backpressure behaviour, or fallback when prefix-aware tips a replica into OOM.

Consistent-hashing realisation (the standard shape)

The typical implementation is consistent hashing over the prefix:

  1. Hash the first N tokens of the request (or the first N tokens that exceed a threshold of shared-prefix probability).
  2. Look up the replica on the hash ring.
  3. Route.

Consistent hashing means:

  • Same prefix → same replica (cache hit).
  • Replica failure / add / remove moves only O(1/N) prefixes to other replicas, not the full traffic set.
  • Replica load is naturally distributed because the prefix space is usually much larger than the replica set.

Known tension: a hot prefix (one system prompt shared across the entire application) funnels all traffic to one replica, which will tip it over. Production implementations combine prefix hashing with overflow / shadow replicas or hot-prefix detection + replication — neither is disclosed in the announcement post.

The ceiling: prefix-aware routing only pays off if KV cache is

reusable

Prefix-aware routing assumes the replica can actually retain the prefix's KV state across requests. Two things go wrong:

  • No KV-cache-reuse support in the serving library → cache is discarded between requests; routing same-prefix requests to the same replica buys nothing.
  • HBM too small for the prefix working set → LRU evicts the prefix between requests; again, same-replica routing buys nothing.

This is why managed tiered KV cache and prefix-aware routing ship together in the HyperPod Inference Operator — tiering makes the prefix survive across requests (fallback to DRAM / NVMe instead of eviction); routing ensures the prefix's second consumer hits the same replica that has it hot.

Comparison with other routing strategies

Strategy Routing feature When it wins
Round-robin / least-conn None Stateless workloads; no cache to amortise
Consistent hashing (session-sticky) Session / user ID Per-user state on the server (classic web session)
Workload-aware Query shape (tables, body text, source header) Heterogeneous backend clusters tuned for different query shapes
Prefix-aware (this) Prompt prefix hash LLM inference with shared-prefix traffic
KV-aware Replica cache-occupancy telemetry LLM inference with unpredictable prefix distribution

Prefix-aware routing shares lineage with session-sticky routing — both use a consistent-hash key to pin request → replica. The key difference is that prefix stickiness is value-based (compute the key from payload content), not identity-based (use a session cookie). Two different sessions with the same system prompt route to the same replica — deliberately.

Seen in

Last updated · 200 distilled / 1,178 read