PATTERN Cited by 1 source

Shared state store as topology unifier¶

Attach a single Kafka Streams state store to two (or more) otherwise-disjoint processing branches specifically to force the framework to merge their sub-topologies into a single unified topology, thereby restoring cross-topic partition colocation guarantees — even when the state store's storage role is incidental or even unnecessary.

The problem this solves¶

Kafka Streams partitions are assigned per-sub-topology. Two topic consumers that share no operator or state between them form separate sub-topologies, and their partitions are assigned independently — meaning same-key records across the two topics can land on different instances. This breaks:

Cross-topic cache locality
Cross-topic deduplication via a node-local map
Any cross-topic invariant scoped to a single key

…without a loud error. The application runs; correctness is preserved; but the local-cache-shaped optimisation the developer wrote silently degrades into N independent caches that hit the backing store N times per unique key.

The pattern¶

Define a Kafka Streams state store (StoreBuilder, any supported backing: RocksDB, in-memory, custom).
Attach the store to processors on both branches of the topology — either via topology.addStateStore(store, "branch1-processor", "branch2-processor") in the low-level Processor API, or by registering the store with a processor that both branches hand off to in the DSL.
Use the store in place of the per-branch local cache (Guava, Caffeine, …).

That single shared dependency is what Kafka Streams looks for: once two source-containing subgraphs reference the same store, the framework merges them into one sub-topology, and the partition assignor treats them as co-partitioned.

Side effects to accept (or deliberately use)¶

Fault tolerance. State-store writes are mirrored to a changelog topic — the store survives task failover. Often a feature; sometimes a cost axis if the cache was truly ephemeral.
Slight per-op latency. RocksDB-backed stores are slower per operation than a pure in-memory Guava cache. In-memory Kafka Streams state stores close most of that gap but still pay some framework cost.
Changelog-topic storage and bandwidth. Sized to the working-set keyspace, retention/cleanup configurable.

None of these are fatal for a cache-shaped workload — and the Expedia post is clear that the external-API cost the cache was eliminating dominates all of these.

Applicability / non-applicability¶

Use this when: - Two (or more) Kafka Streams input topics share a keying strategy and need cross-topic colocation for caching / deduplication / cross-topic aggregation that the framework doesn't explicitly offer. - You're not joining the streams (if you are, use KStream.join() — the framework's co-partitioning check handles unification automatically). - You've verified via production behavior or the assignor log that the topics are in separate sub-topologies.

Don't use this when: - A built-in join fits — joins already force co-partitioning. - Adding an intermediate repartition() is a better fit (e.g. you actually want to re-key before merging, not just unify topology shape). - The expensive backing resource can itself be cheaply sharded — a distributed cache keyed by the record key is a simpler answer when changelog-topic cost is a concern.

Why it generalises¶

The structural insight is that the shape of the topology DAG — specifically the connected-component structure under shared-operator/shared-store edges — is the primary input to the partition assignor. Developers reason about Kafka Streams as a dataflow; but the runtime reasons about it as a graph partitioning problem. Closing a carefully-chosen cycle in that graph is a direct lever on placement. The shared state store is the lightest-weight way to close it.

From the Expedia post: "Using a shared state store as a unifying element is a powerful pattern — not only for sharing data, but also for influencing execution behavior in a distributed Kafka Streams application."

Seen in¶

sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia's canonical case; Guava-cache-per-instance → shared state store attached to both branches; sub-topology unified; same-index partitions co-assigned; redundant external-API calls eliminated.