Skip to content

CONCEPT Cited by 1 source

Sticky routing

Sticky routing is the property that the same logical key continues to be routed to the same node across routing-map updates, redeployments, and rolling restarts (within the bounds of necessary rebalancing).

It is the routing-layer analogue of session affinity, but applied to stateful backend components rather than user sessions. The purpose is state preservation: when a node holds in-memory state keyed on some identifier, sticky routing keeps that identifier on the same node so the state stays warm and useful.

Where it's the load-bearing property

  • Stateful stream aggregators — a Telegraf aggregator computing 1-minute percentile over metric{pod_id=...} needs every sample for that pod to arrive at the same aggregator. Redeploying that aggregator means either (a) handing state off to the successor or (b) accepting a cold start. Sticky routing + minimal assignment change keeps the successor node at the same partition.
  • In-memory KV caches — canonical use case for systems/dicer; moving a key to a different pod means a cache miss.
  • LLM KV cache affinity — routing a conversation's follow-up calls to the pod that holds its KV cache.
  • Soft leader election — one pod owns a keyspace partition; sticky routing ensures its identity is stable even as the cluster autoscales.

Alternatives and trade-offs

  • Kafka-backed partitioning gives explicit durability — aggregator state is rebuild-able from the Kafka log on any node. Cost: higher storage + compute + ingestion latency.
  • Consistent hashing without the sticky property — rebalancing minimises the number of keys that move, but does move them, and moved keys lose their warm state.
  • Sticky routing via auto-sharder — Dicer adjusts assignments minimally on health / load / termination events and signals the library-level listener on the pod so the application can hand off state to the new owner or accept a warm-up period. Eventually-consistent assignment (see concepts/eventual-consistency) is a deliberate trade-off — the team accepts some assignment-update lag in exchange for recovery speed and availability.

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical disclosure in production context. Databricks rejected Kafka-based partitioning ("costly at our scale and adds ingestion delay that impacts real-time usecases") and built on Dicer sticky routing for metric aggregation. "This architecture uses intelligent sticky routing instead of rerouting metrics across aggregators, which addressed the redeployment failure modes." Sustains >1 GB/s per region across thousands of aggregation rules.
Last updated · 451 distilled / 1,324 read