SYSTEM Cited by 2 sources

Kafka Streams¶

Kafka Streams is a JVM client library for building stream-processing applications on top of systems/kafka. Not a separate cluster: the application is the stream processor, packaged as an ordinary JAR, scaled by running more instances. The central abstraction is the Topology — a DAG of source nodes (topics), processor nodes (map / filter / aggregate / join / user Processor), state stores, and sinks.

Topology, sub-topology, task (the structural vocabulary)¶

Topology — the full DAG the application hands to KafkaStreams.
Sub-topology — a connected component of the topology under the graph of shared operators / state stores. Kafka Streams derives sub-topologies implicitly when the application is constructed. Two source nodes with no shared operator or state between them form separate sub-topologies — even when they live in the same Topology object.
Task — the unit of parallelism: one task per sub-topology per partition index. When two topics consumed in the same sub-topology have matching partition counts, their partition i is owned by the same task, giving cross-topic partition colocation.

Partition assignor¶

StreamsPartitionAssignor assigns tasks to application instances on rebalance. Its co-partitioning check is per-sub-topology: topics within the same sub-topology are validated for matching partition counts and assigned together (same index → same task → same instance). Cross-sub-topology assignments are independent — that is the substrate rule behind the Expedia bug (see below).

State stores¶

Kafka Streams ships a managed storage abstraction — state stores — attached to processor nodes. They are:

Local to the task that owns them (RocksDB-backed by default, or in-memory, or pluggable).
Fault-tolerant via a changelog topic (every write mirrored).
Restorable on task failover (replay the changelog into a new task's local store).

State stores serve two roles, the second of which is the Expedia post's contribution:

Storage — hold aggregate state, join buffers, session state.
Topology-shape structural role — attaching the same state store to two otherwise-disjoint processing branches forces Kafka Streams to merge them into a single sub-topology, which restores cross-topic partition colocation. See patterns/shared-state-store-as-topology-unifier.

Partition colocation — the Expedia case¶

Expedia consumed from two topics (same partition count, similar keys) and relied on a local Guava cache to deduplicate expensive external-API calls across both streams. In production, identical keys landed on different application instances — because the two topics had been wired as independent sub-topologies (no shared state / operator) and Kafka Streams never tried to colocate their partitions. Fix: replace the Guava cache with a Kafka Streams state store attached to both branches, forcing sub-topology unification and restoring colocation. Source: sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation.

When cross-topic colocation is actually guaranteed by default¶

Explicit joins: KStream.join(KStream), KStream.join(KTable), etc. perform a co-partitioning check inside the assignor (matching partition counts required, same sub-topology by construction) — you don't need the state-store trick. The Expedia post is careful to distinguish its use case from this one: "this was not a join use case".
Repartition before merge: inserting through() / repartition() also lands both sides in a shared sub-topology by construction.
All other cross-topic coordination (cache-sharing, cross-topic stats, dedup) is on the developer to structure deliberately.

Seen in¶

sources/2024-05-09-highscalability-kafka-101 — canonical upstream-framing source positioning Kafka Streams as "a client library that exposes a high-level API for processing, transforming and enriching data in real time" — "native to Kafka and therefore doesn't require a separate complex deployment strategy" — "it is deployed as a regular Kafka client, usually within your application." Confirms exactly-once processing semantics when input + output are Kafka topics, and names the consumer-group protocol as the horizontal-scale substrate.
sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia production debugging of implicit-two-sub-topology bug; shared state store used purely for topology-shape unification; formulates the rule: "Kafka Streams only guarantees partition colocation across topics if those topics are consumed within the same sub-topology and have the same number of partitions."