High Scalability — Kafka 101¶
Summary¶
Long-form explainer by Stanislav Kozlovski (Apache Kafka committer,
guest author for High Scalability, 2024-05-09) distilling Apache Kafka —
originally developed at LinkedIn in 2011 — into a single tour of the
system's architectural substrate. Covers the append-only log as the
fundamental data structure (O(1) writes/reads at head/tail, immutability,
HDD-sequential-IO alignment), the broker / partition / replica core
model, leader-based replication with in-sync-replica semantics and the
acks=0/1/all + min.insync.replicas producer-durability knobs,
consumer groups with partition-exclusivity for in-partition ordering,
the Controller and the move from ZooKeeper
to KRaft (Kafka Raft) for distributed consensus on
cluster metadata, Tiered Storage offloading historical segments to
object stores (e.g. S3) to break the
co-located-storage-with-broker scaling walls, Cruise Control
as the open-source bin-packing rebalancer, and the two framework
extensions Kafka Connect (source/sink
connector runtime) + Kafka Streams (JVM-library
stream processor with exactly-once semantics when input + output are Kafka
topics). Closes with the industry trajectory: "everybody standardizing on
the Kafka API and competing on the underlying implementation" —
Confluent's Kora, Redpanda
(C++ rewrite), WarpStream (S3-heavy, no replication
or broker statefulness). Tier-1 aggregator re-telling of canonical Kafka
material — the underlying claims are textbook Kafka and the post's value
is as a compact structural index into Kafka's design principles for
readers who already know the vocabulary or need a single-page tour.
Key takeaways¶
- Kafka's data model is the log. Topics are sharded into partitions, each partition is a log (immutable, append-only, ordered); O(1) writes/reads at the tail/head make throughput independent of log size, and immutability makes concurrent reads efficient. (Source: sources/2024-05-09-highscalability-kafka-101)
- The log was chosen to match HDD physics, not DRAM. Linear reads and writes on an HDD are fast (no seeks); the log is exactly linear. Cross-linked to the companion HDD-economics framing in sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale: HDDs are "6,000,000,000× cheaper per byte (inflation-adjusted) since their inception," while random-access IOPS has been flat at ~120/drive. Kafka's architecture is "optimized for a cost-efficient on-premise deployment of a system that stores a lot of data while also being very performant." See concepts/hdd-sequential-io-optimization.
- Performance is stacked OS optimizations, not clever DRAM tricks.
Kafka persists all records to disk; it does not explicitly keep
records in memory. Speed comes from:
(a) producer-side protocol batching — records grouped per partition
reduce network overhead and become a single linear HDD write
(patterns/batch-over-network-to-broker),
(b) OS read-ahead + write-behind that converts the sequential
access pattern into prefetch + async flush,
(c) OS pagecache for free caching of the tail of the log
(concepts/pagecache-for-messaging),
(d) zero-copy (
sendfile) from pagecache directly to socket, bypassing the JVM heap (patterns/zero-copy-sendfile-broker). Kozlovski is explicit that (d) is less load-bearing in practice than popularly assumed — encryption/TLS modifies data on the way out and disablessendfile, and CPU is rarely the bottleneck in well-tuned Kafka deployments (network is). - Broker / partition / replica is the physical model. Nodes are brokers; each topic-partition has N replicas (replication factor), only one is leader at a time; the replica set members are either in-sync or out-of-sync (concepts/in-sync-replica-set). "Just how the basic storage unit in an operating system is a file, the basic storage unit in Kafka is a replica (of a partition)" — each replica is itself a sequence of log-segment files. Records are addressed by a monotonic per-partition offset. See concepts/kafka-partition.
- Writes go to the leader only; durability is a per-message dial.
Producers configure
acks: acks=0→ fire-and-forget;acks=1→ ack on leader disk-persist;acks=all(default) → ack when all in-sync replicas persist. Composable withmin.insync.replicasto prevent silent degradation toacks=1when the ISR shrinks to one. Canonical concepts/acks-producer-durability statement; underpinning patterns/leader-based-partition-replication.- Reads go to any replica; ordering is per-partition. Consumers can
fetch from the closest replica in the network topology. Records
within a partition are ordered; consumer groups enforce
partition-exclusivity — no two consumers in the same group read the
same partition — which is what makes the ordering guarantee end-to-end
(concepts/consumer-group, patterns/consumer-group-partition-exclusivity).
Offsets are persisted in a special internal topic
__consumer_offsets, and the partition-leader of that topic is the Group Coordinator for that consumer group. Multiple consumer groups can read the same topic independently — producer/consumer decoupling is the feature that beat classic message-bus designs where reads delete data. - The Controller is the single source of truth for cluster metadata,
and Kafka is mid-migration off ZooKeeper. One broker at a time is
the active Controller; it handles topic create/delete, partition
reassignment, and — most importantly — leader election on broker
failure. Historically the Controller election + cluster metadata
(alive brokers, topic names, partition assignments) were stored in
ZooKeeper under the
/controllerzNode with watch-based change notification. Kafka has replaced this with KRaft ("Kafka Raft") — a Raft dialect influenced by Kafka's existing replication protocol — storing cluster metadata as an ordinary Kafka log in a special__cluster_metadatatopic whose leader is the active Controller; other KRaft quorum members are hot standbys with the metadata log in memory, and all regular brokers replicate the metadata topic asynchronously instead of subscribing to ZooKeeper watches (concepts/kraft-metadata-log). First production-ready KRaft shipped in Kafka 3.3 (October 2022); Kafka 4.0 (expected Q3 2024) will remove ZooKeeper entirely. - KRaft has two deployment modes. Combined (one broker serves both data and controller roles — ZooKeeper-era shape); isolated (dedicated controller nodes, typically 3-node quorum). Brokers reach eventual metadata consistency by tailing the log, not by polling a coordinator.
- Co-locating storage with brokers hits four structural walls at scale, motivating Tiered Storage (patterns/tiered-storage-to-object-store). For a typical 3TB/broker × 3× replication = 9TB-per-partition-set deployment: (i) log recovery after ungraceful shutdown rebuilds local index files — hours to days on a 10TB disk (concepts/log-recovery-time); (ii) historical reads exhaust HDD IOPS (120/drive ceiling from concepts/hdd-sequential-io-optimization); consumers reading the tail hit pagecache, consumers reading the tail-minus-N hit HDD and compete with producers for IOPS; (iii) hard disk failures trigger full 10TB re-replication from leaders, amplifying (ii) during the entire recovery window; an AZ-wide hard failure amplifies further; (iv) partition reassignment (add a broker, remove a broker, or rebalance) copies whole replicas byte-for-byte. Tiered Storage offloads cold/historical segments to an object store (e.g. S3); leader brokers tier on write; both leader and follower brokers can serve historical reads directly from the object store. Kozlovski cites 43% producer-performance improvement when historical consumers are present, in development benchmarks. In Early Access at time of writing.
- Partition rebalancing is NP-hard; Cruise Control is the canonical open-source answer. "Essentially the NP-hard Bin Packing problem at heart" (concepts/bin-packing). Cruise Control (LinkedIn-originated) reads per-broker metrics from a Kafka topic, builds an in-memory cluster model, runs a greedy heuristic bin-packer over a prioritised set of Goals (replica count, disk usage, network in/out, CPU, etc.), and applies the resulting reassignment incrementally via Kafka's low-level reassignment API. Continuously monitors cluster metrics and auto-triggers rebalance when thresholds are exceeded. Also exposes "add broker / remove broker" as first-class ops because Kafka brokers are stateful even with Tiered Storage.
- Kafka Connect is the source/sink framework (systems/kafka-connect). Runs a cluster of workers (standalone or distributed mode); each worker hosts connector plugins authored once, reused forever (ElasticSearch, Snowflake, Postgres, MySQL, BigQuery, …). Source connectors pull data into Kafka; sink connectors push data out of Kafka. Workers use internal Kafka topics for configuration, status, and offset checkpoints; task assignment across workers reuses the Consumer Group protocol. Users install plugins and drive a REST API — integration becomes configuration, not custom code. Community- maintained connectors ship exactly-once, ordering, and fault- tolerance guarantees by default.
- Kafka Streams is a client library, not a cluster
(systems/kafka-streams). Embeds in an ordinary JAR/service; the
application is the stream processor. Exposes a high-level
map/filter/aggregate/joinAPI plus state stores; scales horizontally by running more application instances using the Consumer Group protocol. Supports exactly-once processing semantics when both input and output are Kafka topics. Deployment is ordinary-application deployment — no additional stream-processing cluster. - The industry is converging on the Kafka API over Kafka the
binary. Kozlovski's forecast (2024): "evidence points that the
future will be everybody standardizing on the Kafka API and
competing on the underlying implementation". Three named
implementations:
- Kora — Confluent's cloud-native Kafka engine (founded by the original creators);
- Redpanda — C++ rewrite;
- WarpStream — S3-heavy architecture that "leverages S3 heavily, completely avoiding replication and broker statefulness" — the extreme end of the Tiered Storage trajectory. SaaS/serverless offerings span the spectrum from thin-managed (users still operate the cluster) to fully-abstracted.
Architectural shape¶
Producers
│
▼
┌─────────────────────────────────────────────┐
│ Kafka broker fleet │
│ │
│ topic = partition_0 partition_1 … _N-1 │
│ │ │ │
│ ▼ ▼ │
│ leader leader │
│ + N-1 + N-1 │
│ followers followers │
│ (ISR) (ISR) │
│ │
│ active Controller (one broker) │
│ └─ partition leader elections │
│ └─ reassignment ack │
│ │
│ metadata │
│ ├─ pre-3.3: ZooKeeper /controller zNode │
│ └─ 3.3+ : KRaft quorum (3 nodes) → │
│ __cluster_metadata topic │
│ │
│ storage │
│ ├─ local disk (hot tail) │
│ └─ S3 / object store (cold, tiered) │
└─────────────────────────────────────────────┘
│
▼
Consumer groups
├─ per-group Group Coordinator = leader of
│ corresponding __consumer_offsets partition
└─ partition-exclusivity within a group
Numbers disclosed¶
- 24 notable Kafka releases since 2011 (post date 2024-05-09).
- Codebase growth rate: 24% per release on average.
- Kafka target scale: "millions of messages a second", "terabytes" of storage.
- Typical brokers store 3TB × replication factor 3 = 9TB of historical data per partition set.
- ~10TB local disk is the rough operational ceiling at which the four structural walls (log recovery, historical reads, re-replication, rebalancing) start hurting.
- Log recovery after ungraceful shutdown: "hours if not days" on a ~10TB disk.
- HDD random-access IOPS: ~120/drive (cross-linked to the S3 companion post).
- Tiered Storage improvement: "43% producer performance improvement when historical consumers were present" in dev tests.
- First production-ready KRaft: Kafka 3.3 (Oct 2022).
- ZooKeeper removal: Kafka 4.0 (expected Q3 2024).
- KRaft quorum size: "usually 3" controllers.
- Kafka longevity: "13 years of development" as of 2024.
Numbers not disclosed¶
- No producer-throughput, consumer-throughput, or end-to-end latency numbers on any specific deployment.
- No KRaft metadata-log throughput / convergence-time numbers.
- No Cruise Control rebalance-time-to-converge numbers.
- No Tiered-Storage object-store-read tail-latency numbers (only the single 43%-producer-improvement datapoint).
- No ISR-shrink / leader-election times.
- No Kafka Connect throughput per worker.
- No Kafka Streams application-latency or state-store-recovery times.
- No Confluent/Redpanda/WarpStream head-to-head benchmarks.
Caveats¶
- Tier-1 aggregator explainer voice, not a production retrospective or incident post-mortem. Canonical / textbook Kafka material; the value is as a structural index rather than as a new datapoint.
- Kozlovski is a Kafka committer and author of the 2-Minute Streaming newsletter; the post-level framing reflects the community's consensus but Tiered Storage was in Early Access at publication and production- ready claims should be checked against current upstream status.
- The Kafka-4.0 / ZooKeeper-removal timeline is a 2024 forecast; real-world cutover timing should be verified against current releases before operational decisions.
- Zero-copy (
sendfile) is famous-but-overstated: Kozlovski is explicit that encryption/TLS (required on any production cluster) modifies bytes on the way out and disablessendfile; CPU is not usually the Kafka bottleneck. The win is smaller than 2000s-era Kafka performance posts suggest. - The survey of Kora / Redpanda / WarpStream is paragraph-long positioning only — no architectural depth on any of them.
- No security content (auth, ACLs, encryption in transit / at rest, audit).
- No schema content (Confluent Schema Registry, Avro, Protobuf compatibility modes) — those live in companion posts.
Relationship to existing wiki¶
- Canonical expanding source for systems/kafka — prior wiki coverage of Kafka was single-use-case framings (Expedia sub-topology partition colocation; Voyage AI token-count batching; Datadog CDC replication). This post gives the ground-up architectural tour the Kafka system page needed a source for.
- Canonical primary source for systems/kafka-connect + systems/kafka-streams — both already had pages built from single-customer framings (Datadog CDC; Expedia sub-topology); this post adds the Apache-upstream structural-vocabulary disclosure.
- New system pages: systems/apache-zookeeper (only referenced obliquely until now — "Kafka used to persist all sorts of metadata in ZooKeeper"), systems/kraft (the Raft-dialect successor to ZooKeeper inside the Kafka deployment), systems/cruise-control (LinkedIn-originated NP-hard rebalancer), systems/confluent-kora, systems/redpanda, systems/warpstream.
- New concepts: concepts/distributed-log (Kafka's core abstraction); concepts/kafka-partition; concepts/in-sync-replica-set; concepts/consumer-group; concepts/leader-follower-replication; concepts/acks-producer-durability; concepts/kraft-metadata-log; concepts/pagecache-for-messaging; concepts/log-recovery-time; concepts/hdd-sequential-io-optimization (cross-linked to the companion S3 post's HDD economics framing).
- New patterns: patterns/append-only-log-as-substrate; patterns/leader-based-partition-replication; patterns/tiered-storage-to-object-store; patterns/consumer-group-partition-exclusivity; patterns/batch-over-network-to-broker; patterns/zero-copy-sendfile-broker.
- Cross-source continuity: companion to sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale — same author (Kozlovski), same aggregator venue, published two months apart; the Kafka-101 post explicitly cites the S3 post for HDD economics. The two posts together form a Tier-1 structural primer pair on storage-at-scale economics (S3 side) and streaming-at-scale on top of that storage (Kafka side). Extends systems/kafka beyond its prior single-use-case framings with the full architectural tour. Complements sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform (Kafka-as-CDC-transport at a platform-integrator, not Kafka-as-substrate). Complements sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation (Kafka Streams sub-topology debugging; this post gives the substrate-level vocabulary that one builds on).
Source¶
- Original: https://highscalability.com/untitled-2/
- Raw markdown:
raw/highscalability/2024-05-09-kafka-101-94bf812e.md
Related¶
- companies/highscalability
- systems/kafka
- systems/kafka-connect
- systems/kafka-streams
- systems/apache-zookeeper
- systems/kraft
- systems/cruise-control
- systems/confluent-kora
- systems/redpanda
- systems/warpstream
- concepts/distributed-log
- concepts/kafka-partition
- concepts/in-sync-replica-set
- concepts/consumer-group
- concepts/leader-follower-replication
- concepts/acks-producer-durability
- concepts/kraft-metadata-log
- concepts/pagecache-for-messaging
- concepts/log-recovery-time
- concepts/hdd-sequential-io-optimization
- patterns/append-only-log-as-substrate
- patterns/leader-based-partition-replication
- patterns/tiered-storage-to-object-store
- patterns/consumer-group-partition-exclusivity
- patterns/batch-over-network-to-broker
- patterns/zero-copy-sendfile-broker
- sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale
- sources/2025-11-04-datadog-replication-redefined-multi-tenant-cdc-platform
- sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation