Skip to content

PATTERN Cited by 1 source

Sticky routing for aggregator state

A stateful stream-aggregation tier holds in-memory running state per input key — counters, percentile reservoirs, histogram buckets. That state is expensive to rebuild and must survive routine lifecycle events: scale-ups, scale-downs, rolling upgrades, bad-host drains.

The two candidate architectures for preserving state:

  • Kafka-backed partitioning — explicit durability via a write-ahead log. On aggregator restart, replay from Kafka from the last checkpoint. Semantically clean, operationally expensive (Kafka cluster + storage + serialization cost), and adds ingestion latency incompatible with real-time alerting.
  • Sticky routing + in-memory state — route the same input keys to the same aggregator node across restarts via an auto-sharder that adjusts assignments minimally. State survives locally; no Kafka required. Cheaper, lower-latency, at the cost of relying on the sharder's eventual consistency for correctness.

This pattern is the second architecture, operationalized via Databricks' Dicer auto-sharder.

Shape

  Input stream ─▶  Router (Dicer Clerk)  ─▶  Aggregator N  (in-memory state for assigned slice)
                   ▲                             ▼
                   │                      (Slicelet reports per-slice load + health)
                   │                             ▼
                   └──  Dicer Assigner  ◀─── (pod churn signal)
                          adjusts assignments minimally
  • Router uses the Dicer Clerk library — local cache of the current assignment, no per-request RPC.
  • Aggregator runs the Slicelet library, which reports per-slice load back to the Assigner and notifies the application on assignment changes.
  • Assigner is the auto-sharder control plane — it adjusts assignments on health / load / termination events, making only minimal changes.

When an aggregator is redeployed:

  1. Termination signal reaches the Assigner before the pod disappears.
  2. Assigner reassigns affected slices to neighbouring pods.
  3. Application-level state handoff or warm-up happens on the new owner (stateful-streaming-app-specific).
  4. Only affected slices are moved; the remaining slices stay put, and their state stays warm.

Why sticky (not consistent hashing)

Consistent hashing already minimises the number of keys that move on a cluster change. Sticky routing goes further: it explicitly prefers keeping a key on its current node unless load / health pressure forces a move, and it coordinates with the application on the transition.

The distinction matters because in-memory state is not cheap to rebuild for percentile / histogram aggregations, and losing it even briefly causes user-visible metric degradation.

Trade-off vs Kafka

  • Cost — no Kafka cluster, no long-term message storage. At >1 GB/s ingestion in the largest region, Kafka's economics would be unfavourable.
  • Latency — no serialise-to-Kafka + re-consume round trip, which Databricks explicitly flagged as "adds ingestion delay that impacts real-time usecases."
  • Durability — weaker than Kafka. If an aggregator crashes with un-upstreamed output, some samples may be dropped. The trade-off is justified because the downstream TSDB (Pantheon) will continue to receive samples from the surviving replicas + reassigned slices, and the metrics platform already accepts some windowing imprecision (sub-second fidelity is not the goal).
  • Consistency — aggregator assignment is eventually consistent; the Slicelet + Clerk briefly see stale assignment after a change. The team accepts this in exchange for recovery speed and availability.

When to reach for this pattern

  • Stateful stream aggregator, in-memory state per key, rebuild cost high.
  • Real-time / near-real-time downstream consumers — Kafka's latency overhead is too high.
  • Scale where Kafka durability per-message is more expense than you need — best-effort windowed metrics, not ledger-like financial events.

When to use Kafka instead

  • Event-sourced systems where exactly-once semantics across restarts are a hard requirement.
  • Workflows requiring durable replay for months / years of state reconstruction.
  • Heterogeneous downstream consumers needing decoupled delivery.

Seen in

  • sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. "These problems are often solved by using a messaging system like Kafka for partitioning assignments and maintaining previous data; this is costly at our scale and adds ingestion delay that impacts real-time usecases. The alternative approach is to store in-memory state in aggregators and reroute metrics between aggregators to honor assignment. However, this leads to data loss when an aggregator is redeployed; in an initial version of our aggregation infrastructure, this behavior made aggregated metrics almost unintelligible to our users. To make this work seamlessly, we instead developed our own aggregation system using Telegraf and Databricks' 'auto-sharder' service Dicer. This architecture uses intelligent sticky routing instead of rerouting metrics across aggregators, which addressed the redeployment failure modes." Sustains >1 GB/s per region, thousands of aggregation rules.
Last updated · 451 distilled / 1,324 read