SYSTEM Cited by 1 source
Kafka Streams¶
Kafka Streams is a JVM client library for building stream-processing
applications on top of systems/kafka. Not a separate cluster: the
application is the stream processor, packaged as an ordinary JAR,
scaled by running more instances. The central abstraction is the
Topology — a DAG of source nodes (topics), processor nodes
(map / filter / aggregate / join / user Processor), state
stores, and sinks.
Topology, sub-topology, task (the structural vocabulary)¶
- Topology — the full DAG the application hands to
KafkaStreams. - Sub-topology — a connected
component of the topology under the graph of shared operators /
state stores. Kafka Streams derives sub-topologies implicitly
when the application is constructed. Two source nodes with no
shared operator or state between them form separate
sub-topologies — even when they live in the same
Topologyobject. - Task — the unit of parallelism: one task per sub-topology per
partition index. When two topics consumed in the same
sub-topology have matching partition counts, their partition
iis owned by the same task, giving cross-topic partition colocation.
Partition assignor¶
StreamsPartitionAssignor assigns tasks to application instances on
rebalance. Its co-partitioning check is per-sub-topology: topics
within the same sub-topology are validated for matching partition
counts and assigned together (same index → same task → same
instance). Cross-sub-topology assignments are independent — that
is the substrate rule behind the Expedia bug (see below).
State stores¶
Kafka Streams ships a managed storage abstraction — state stores — attached to processor nodes. They are:
- Local to the task that owns them (RocksDB-backed by default, or in-memory, or pluggable).
- Fault-tolerant via a changelog topic (every write mirrored).
- Restorable on task failover (replay the changelog into a new task's local store).
State stores serve two roles, the second of which is the Expedia post's contribution:
- Storage — hold aggregate state, join buffers, session state.
- Topology-shape structural role — attaching the same state store to two otherwise-disjoint processing branches forces Kafka Streams to merge them into a single sub-topology, which restores cross-topic partition colocation. See patterns/shared-state-store-as-topology-unifier.
Partition colocation — the Expedia case¶
Expedia consumed from two topics (same partition count, similar keys) and relied on a local Guava cache to deduplicate expensive external-API calls across both streams. In production, identical keys landed on different application instances — because the two topics had been wired as independent sub-topologies (no shared state / operator) and Kafka Streams never tried to colocate their partitions. Fix: replace the Guava cache with a Kafka Streams state store attached to both branches, forcing sub-topology unification and restoring colocation. Source: sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation.
When cross-topic colocation is actually guaranteed by default¶
- Explicit joins:
KStream.join(KStream),KStream.join(KTable), etc. perform a co-partitioning check inside the assignor (matching partition counts required, same sub-topology by construction) — you don't need the state-store trick. The Expedia post is careful to distinguish its use case from this one: "this was not a join use case". - Repartition before merge: inserting
through()/repartition()also lands both sides in a shared sub-topology by construction. - All other cross-topic coordination (cache-sharing, cross-topic stats, dedup) is on the developer to structure deliberately.
Seen in¶
- sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia production debugging of implicit-two-sub-topology bug; shared state store used purely for topology-shape unification; formulates the rule: "Kafka Streams only guarantees partition colocation across topics if those topics are consumed within the same sub-topology and have the same number of partitions."