SYSTEM Cited by 3 sources
Apache Kafka¶
Distributed, partitioned, replicated append-only log; the canonical open-source streaming-messaging substrate. Producers write keyed records to topics, which are split into partitions for horizontal scale; consumers read from partitions with at-least-once (default) or exactly-once semantics. Keyed records with the same key land on the same partition (hash-of-key ⇒ partition index), which is the foundation that higher layers (like systems/kafka-streams) build partition-local guarantees on top of.
Core primitives (referenced by Kafka Streams analyses)¶
- Topic — named, append-only logical log.
- Partition — unit of parallelism; each topic is split into N partitions; ordering is guaranteed within a partition only.
- Record key — the hash of the key determines the partition. Identical keys land on the same partition of a single topic by construction.
- Consumer group — set of consumers that cooperatively divide partitions of a topic; Kafka Streams is layered on top of this primitive.
Cross-topic keying (the Expedia sub-topology lesson)¶
Two topics with identical partition counts and similarly-keyed records do not by themselves guarantee that the same key lands on the same consumer instance across the two topics — that is a property of the consuming framework, not of Kafka itself. Kafka only guarantees same-key-to-same-partition per topic. When systems/kafka-streams is the consumer, the extra colocation guarantee is sub-topology-scoped — see concepts/partition-colocation and sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation.
Batching semantics — what Kafka does and doesn't do¶
Kafka's producer-side batching is byte-count + message-count + time-window within a partition:
batch.size— max bytes buffered per partition before dispatch.linger.ms— max wait to accumulate a batch.max.in.flight.requests.per.connection— pipeline depth.
These compose to a transport-economics batching primitive (saturate TCP, amortise broker bookkeeping) but don't express payload-attribute budgets — "Kafka batches by bytes/messages within a partition; token count varies with text and tokenizer, so there is no efficient way to batch requests by Σ token_count_i" (2025-12-18 Voyage AI). For application-specific batching such as token-count batching for GPU inference, the pattern is to keep Kafka for durability / fan- out / delivery and insert a lightweight aggregator between Kafka and the workers that runs application batching logic.
Seen in¶
- sources/2025-12-18-mongodb-token-count-based-batching-faster-cheaper-embedding-inference — named alongside RabbitMQ as a general-purpose broker whose native batching primitives don't fit token-count batching of GPU embedding-inference requests. Voyage AI chose a native store (Redis + Lua, see patterns/atomic-conditional-batch-claim) rather than put an aggregator in front of Kafka — but either path works, and the aggregator variant keeps Kafka's durability.
- sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation — Expedia's production debugging case where "same partition count + similar keying across two topics" was assumed (incorrectly) to imply cross-topic colocation at the consumer-instance level; the missing ingredient turned out to be a Kafka-Streams-layer constraint (shared sub-topology), not a Kafka-broker one.
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform — Kafka as the durable transport middle of Datadog's managed multi-tenant CDC replication platform. Debezium source connectors on Kafka Connect publish Avro-serialised record streams to Kafka topics (validated against a Kafka Schema Registry in backward-compat mode); sink connectors drain topics into Elasticsearch / Postgres / Iceberg / Cassandra / cross-region Kafka. Canonical wiki instance of patterns/debezium-kafka-connect-cdc-pipeline. Cross-region Kafka replication also cited as an in-platform sink for Datadog On-Call data locality + resilience.
Related¶
- systems/kafka-streams — Kafka-native stream-processing framework
- systems/kafka-connect — Kafka-native connector framework (hosts Debezium + sink connectors)
- systems/debezium — Kafka Connect-based CDC source connector family
- systems/kafka-schema-registry — Avro-schema gating integrated with producers + consumers
- concepts/partition-colocation
- concepts/sub-topology
- concepts/change-data-capture
- systems/rabbitmq — sibling general-purpose broker with push-model prefetch batching
- patterns/lightweight-aggregator-in-front-of-broker — canonical shape for application batching (e.g. token-count batching) on top of Kafka
- patterns/debezium-kafka-connect-cdc-pipeline — canonical CDC pipeline shape built on Kafka
- concepts/token-count-based-batching