Skip to content

SYSTEM Cited by 1 source

Kafka Streams

Kafka Streams is a JVM client library for building stream-processing applications on top of systems/kafka. Not a separate cluster: the application is the stream processor, packaged as an ordinary JAR, scaled by running more instances. The central abstraction is the Topology — a DAG of source nodes (topics), processor nodes (map / filter / aggregate / join / user Processor), state stores, and sinks.

Topology, sub-topology, task (the structural vocabulary)

  • Topology — the full DAG the application hands to KafkaStreams.
  • Sub-topology — a connected component of the topology under the graph of shared operators / state stores. Kafka Streams derives sub-topologies implicitly when the application is constructed. Two source nodes with no shared operator or state between them form separate sub-topologies — even when they live in the same Topology object.
  • Task — the unit of parallelism: one task per sub-topology per partition index. When two topics consumed in the same sub-topology have matching partition counts, their partition i is owned by the same task, giving cross-topic partition colocation.

Partition assignor

StreamsPartitionAssignor assigns tasks to application instances on rebalance. Its co-partitioning check is per-sub-topology: topics within the same sub-topology are validated for matching partition counts and assigned together (same index → same task → same instance). Cross-sub-topology assignments are independent — that is the substrate rule behind the Expedia bug (see below).

State stores

Kafka Streams ships a managed storage abstraction — state stores — attached to processor nodes. They are:

  • Local to the task that owns them (RocksDB-backed by default, or in-memory, or pluggable).
  • Fault-tolerant via a changelog topic (every write mirrored).
  • Restorable on task failover (replay the changelog into a new task's local store).

State stores serve two roles, the second of which is the Expedia post's contribution:

  1. Storage — hold aggregate state, join buffers, session state.
  2. Topology-shape structural roleattaching the same state store to two otherwise-disjoint processing branches forces Kafka Streams to merge them into a single sub-topology, which restores cross-topic partition colocation. See patterns/shared-state-store-as-topology-unifier.

Partition colocation — the Expedia case

Expedia consumed from two topics (same partition count, similar keys) and relied on a local Guava cache to deduplicate expensive external-API calls across both streams. In production, identical keys landed on different application instances — because the two topics had been wired as independent sub-topologies (no shared state / operator) and Kafka Streams never tried to colocate their partitions. Fix: replace the Guava cache with a Kafka Streams state store attached to both branches, forcing sub-topology unification and restoring colocation. Source: sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation.

When cross-topic colocation is actually guaranteed by default

  • Explicit joins: KStream.join(KStream), KStream.join(KTable), etc. perform a co-partitioning check inside the assignor (matching partition counts required, same sub-topology by construction) — you don't need the state-store trick. The Expedia post is careful to distinguish its use case from this one: "this was not a join use case".
  • Repartition before merge: inserting through() / repartition() also lands both sides in a shared sub-topology by construction.
  • All other cross-topic coordination (cache-sharing, cross-topic stats, dedup) is on the developer to structure deliberately.

Seen in

  • sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia production debugging of implicit-two-sub-topology bug; shared state store used purely for topology-shape unification; formulates the rule: "Kafka Streams only guarantees partition colocation across topics if those topics are consumed within the same sub-topology and have the same number of partitions."
Last updated · 200 distilled / 1,178 read