PATTERN Cited by 1 source
Shared state store as topology unifier¶
Attach a single Kafka Streams state store to two (or more) otherwise-disjoint processing branches specifically to force the framework to merge their sub-topologies into a single unified topology, thereby restoring cross-topic partition colocation guarantees — even when the state store's storage role is incidental or even unnecessary.
The problem this solves¶
Kafka Streams partitions are assigned per-sub-topology. Two topic consumers that share no operator or state between them form separate sub-topologies, and their partitions are assigned independently — meaning same-key records across the two topics can land on different instances. This breaks:
- Cross-topic cache locality
- Cross-topic deduplication via a node-local map
- Any cross-topic invariant scoped to a single key
…without a loud error. The application runs; correctness is preserved; but the local-cache-shaped optimisation the developer wrote silently degrades into N independent caches that hit the backing store N times per unique key.
The pattern¶
- Define a Kafka Streams state store (
StoreBuilder, any supported backing: RocksDB, in-memory, custom). - Attach the store to processors on both branches of the
topology — either via
topology.addStateStore(store, "branch1-processor", "branch2-processor")in the low-level Processor API, or by registering the store with a processor that both branches hand off to in the DSL. - Use the store in place of the per-branch local cache (Guava, Caffeine, …).
That single shared dependency is what Kafka Streams looks for: once two source-containing subgraphs reference the same store, the framework merges them into one sub-topology, and the partition assignor treats them as co-partitioned.
Side effects to accept (or deliberately use)¶
- Fault tolerance. State-store writes are mirrored to a changelog topic — the store survives task failover. Often a feature; sometimes a cost axis if the cache was truly ephemeral.
- Slight per-op latency. RocksDB-backed stores are slower per operation than a pure in-memory Guava cache. In-memory Kafka Streams state stores close most of that gap but still pay some framework cost.
- Changelog-topic storage and bandwidth. Sized to the working-set keyspace, retention/cleanup configurable.
None of these are fatal for a cache-shaped workload — and the Expedia post is clear that the external-API cost the cache was eliminating dominates all of these.
Applicability / non-applicability¶
Use this when:
- Two (or more) Kafka Streams input topics share a keying
strategy and need cross-topic colocation for
caching / deduplication / cross-topic aggregation that the
framework doesn't explicitly offer.
- You're not joining the streams (if you are, use
KStream.join() — the framework's co-partitioning check
handles unification automatically).
- You've verified via production behavior or the assignor log
that the topics are in separate sub-topologies.
Don't use this when:
- A built-in join fits — joins already force co-partitioning.
- Adding an intermediate repartition() is a better fit (e.g.
you actually want to re-key before merging, not just unify
topology shape).
- The expensive backing resource can itself be cheaply sharded
— a distributed cache keyed by the record key is a simpler
answer when changelog-topic cost is a concern.
Why it generalises¶
The structural insight is that the shape of the topology DAG — specifically the connected-component structure under shared-operator/shared-store edges — is the primary input to the partition assignor. Developers reason about Kafka Streams as a dataflow; but the runtime reasons about it as a graph partitioning problem. Closing a carefully-chosen cycle in that graph is a direct lever on placement. The shared state store is the lightest-weight way to close it.
From the Expedia post: "Using a shared state store as a unifying element is a powerful pattern — not only for sharing data, but also for influencing execution behavior in a distributed Kafka Streams application."
Seen in¶
- sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia's canonical case; Guava-cache-per-instance → shared state store attached to both branches; sub-topology unified; same-index partitions co-assigned; redundant external-API calls eliminated.