Skip to content

SYSTEM Cited by 1 source

Axon (Databricks LLM router)

Axon is the Databricks LLM-inference data-plane router — the component that receives post-rate-limited requests and dispatches them across replicas of a target model deployment. First named publicly in the 2026-05-27 Reliable LLM Inference at Scale post:

"To handle traffic across model deployments, the data plane runs a router, which we call Axon, that balances load among replicas of the same model, and an autoscaler that adjusts replica counts." (Source: sources/2026-05-27-databricks-reliable-llm-inference-at-scale)

This is the wiki's first canonical disclosure of the named router sitting between the Databricks rate-limiter and the inference runtime on frontier GPUs. It is structurally distinct from the Armeria-RPC client-side P2C path documented for Databricks intra-cluster RPC and from the EDS/P2C path documented for the broader Databricks Model Serving platform under the Superhuman 200K-QPS workload — Axon is the LLM-specific router, with two structural properties that distinguish it from those:

Two structural properties

1. Load metric is model units, not active requests

Axon does not route on active-request count (the standard P2C signal). It routes on server load measured in model units — Databricks' multi-dimensional LLM-request-cost abstraction. The post is explicit that this is a deliberate departure from P2C-with-active-requests:

"In general, load balancing tends to lean on statistical approaches like P2C (power of two choices), which estimate load based on queue size and leverage sampling to reduce the memory and latency overheads of understanding all the possible targets. However, LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach. Today, we use Dicer, Databricks' auto-sharder, to dynamically route workloads across servers… We integrated model units with Dicer so that routing decisions are based on server load in model units rather than traditional request-based heuristics."

The structural argument: request cost varies non-linearly with input and output token counts (decode dominates; long contexts are disproportionately expensive); a request-count load metric would mask this and pile cheap-and-expensive requests together, causing unpredictable hotspots on a small server count where each misrouting cost is large.

See concepts/non-uniform-llm-request-cost, patterns/cost-based-load-balancing-llm.

2. Stateful sessions for prefix-cache locality + blast radius

Axon implements stateful (sticky) sessions so a workload's requests route to a subset of servers, not the full fleet. This serves two purposes simultaneously:

  • KV prefix-cache locality — coding agents and other latency- sensitive workloads hit the same prefix repeatedly; sticky routing keeps the KV cache / prefix cache warm on the same replicas.
  • Bounded blast radius — a misbehaving workload (request shape that triggers a silent hang or a CPU spike) impacts only its assigned subset; the rest of the fleet is insulated.

Verbatim:

"Dicer also provides stateful sessions, making request routing sticky. A workload's requests go to only a subset of servers, which improves cache hit rates (crucial for latency-sensitive workloads like coding agents) and limits blast radius."

See patterns/stateful-llm-session-routing, concepts/sticky-routing, concepts/blast-radius.

Built on Dicer

Axon's routing substrate is Dicer — Databricks' open-source auto-sharder. Dicer's primitives map cleanly to LLM routing:

Dicer primitive LLM-routing role
SliceKey (hash of routing key) hash of workload session id (or similar caller key)
Slice (range of SliceKeys) session bucket assigned to a server subset
Resource (pod) LLM serving replica (GPU pod)
Assignment (slice → resources) which server-subset serves which session bucket
State transfer during reshard KV-prefix-cache continuity across rolling restarts
Hot-key replication one workload getting too much traffic → its slice is replicated to multiple replicas
Per-key load reporting per-session model-unit load → Axon's load metric

The post canonicalises this as Dicer's "In-memory and GPU serving" use case — "LLM per-session KV cache affinity; LoRA-adapter placement on constrained GPUs" — which is now production-proven on Databricks' LLM-serving stack at 125T+ tokens/month scale. See systems/dicer Canonical use cases and Production case studies.

Position in the data plane

Request ──▶ Rate Limiting ──▶ Axon ──▶ Inference Runtime
                              (router  (in-house engine
                               on Dicer  or vLLM, etc.)
                               + MUs +
                               sticky
                               sessions)
  • Above Axon: rate limiting (control-plane decision, enforced at the edge of the data plane).
  • Below Axon: inference runtime (open-source engines like vLLM / proprietary in-house engines), running on frontier GPUs in disaggregated prefill/decode or co-located form (the post does not specify which Axon assumes).
  • Lateral: the autoscaler also reads server load in model units — "the router and autoscaler both consume server load" — the same primitive feeds two control loops, the per-request routing loop and the per-fleet capacity loop. See concepts/model-unit-utilization-ratio.

Operating envelope (disclosed)

Metric Value
Tokens/month routed (total platform) 125T+
Models routed across Frontier OS (Kimi, Qwen) + proprietary (OpenAI, Gemini, Claude)
Hardware Frontier GPUs (NVL72-class single-spine racks + others)
Customers named Superhuman, YipitData, Fox Sports
GPU savings (autoscaler downstream of Axon, bursty workloads) >80% vs static peak

Comparison to other LLM routers

  • vs P2C-with-active-requests (Databricks Model Serving / EDS path for non-LLM, lower-latency CPU-style workloads) — retired for LLM serving on the structural argument above. P2C remains correct in its own regime, just not at LLM scale.
  • vs token-aware load balancing (Cloudflare Workers AI per concepts/token-aware-load-balancing) — similar in spirit (estimate per-endpoint cost in tokens, not requests), but Axon's cost unit is the aggregated multi-dimensional model unit with separate prefill/decode coefficients and prefix-caching/multimodal modifiers, not raw token counts.
  • vs concepts/prefix-aware-routing (KV-prefix-locality scheduling at the request scheduler altitude) — Axon's stateful sessions are the workload-level version of prefix-aware routing, one altitude up. Prefix-aware routing colocates requests sharing a prefix; Axon colocates workloads sharing a session, on the premise that within-workload prefix sharing is the dominant case for coding-agent-class traffic.

Open questions

  • What is Axon's exact routing key? "Workload" could be tenant, endpoint, session id, or some composite — the post does not say.
  • How is the server-subset size sized? Cache-hit-rate ↑ wants small subset; blast-radius bounding wants small subset; load balancing wants larger subset; the trade-off curve is not disclosed.
  • What happens when a session's assigned subset is overloaded? Spill-over policy unspecified.
  • Is Axon multi-region? The post does not address whether Axon is per-region or global, and how cross-region capacity is handled.
  • What's the relationship between Axon and the EDS/P2C path documented for the rest of Databricks Model Serving? The Superhuman 2026-05-08 post describes EDS+P2C; this post describes Axon+Dicer. Both can't be the complete story — possibly Axon sits behind EDS/P2C, or supersedes it for LLM endpoints, or runs on a parallel deployment shape. Not disclosed.

Source

Seen in

Last updated · 542 distilled / 1,571 read