CONCEPT Cited by 1 source
Sticky routing¶
Sticky routing is the property that the same logical key continues to be routed to the same node across routing-map updates, redeployments, and rolling restarts (within the bounds of necessary rebalancing).
It is the routing-layer analogue of session affinity, but applied to stateful backend components rather than user sessions. The purpose is state preservation: when a node holds in-memory state keyed on some identifier, sticky routing keeps that identifier on the same node so the state stays warm and useful.
Where it's the load-bearing property¶
- Stateful stream aggregators — a
Telegraf aggregator computing 1-minute
percentile over
metric{pod_id=...}needs every sample for that pod to arrive at the same aggregator. Redeploying that aggregator means either (a) handing state off to the successor or (b) accepting a cold start. Sticky routing + minimal assignment change keeps the successor node at the same partition. - In-memory KV caches — canonical use case for systems/dicer; moving a key to a different pod means a cache miss.
- LLM KV cache affinity — routing a conversation's follow-up calls to the pod that holds its KV cache.
- Soft leader election — one pod owns a keyspace partition; sticky routing ensures its identity is stable even as the cluster autoscales.
Alternatives and trade-offs¶
- Kafka-backed partitioning gives explicit durability — aggregator state is rebuild-able from the Kafka log on any node. Cost: higher storage + compute + ingestion latency.
- Consistent hashing without the sticky property — rebalancing minimises the number of keys that move, but does move them, and moved keys lose their warm state.
- Sticky routing via auto-sharder — Dicer adjusts assignments minimally on health / load / termination events and signals the library-level listener on the pod so the application can hand off state to the new owner or accept a warm-up period. Eventually-consistent assignment (see concepts/eventual-consistency) is a deliberate trade-off — the team accepts some assignment-update lag in exchange for recovery speed and availability.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical disclosure in production context. Databricks rejected Kafka-based partitioning ("costly at our scale and adds ingestion delay that impacts real-time usecases") and built on Dicer sticky routing for metric aggregation. "This architecture uses intelligent sticky routing instead of rerouting metrics across aggregators, which addressed the redeployment failure modes." Sustains >1 GB/s per region across thousands of aggregation rules.
Related¶
- systems/dicer — auto-sharder that powers sticky routing at Databricks
- systems/telegraf — canonical sticky-routed consumer
- systems/pantheon
- concepts/hash-ring
- concepts/eventual-consistency
- patterns/sticky-routing-for-aggregator-state