Skip to content

CONCEPT Cited by 3 sources

HDD sequential I/O optimization

Definition

HDD sequential I/O optimization is the design stance of laying data out on disk such that the access pattern is linear — contiguous byte runs — and therefore does not pay the seek cost that dominates random-access workloads on a spinning drive. HDDs are "slow" because of head-seek time (tens of ms, fundamentally bounded by the physical movement of the read-write head); they are fast when the head sweeps contiguous sectors.

This is the design lever under:

  • Log-structured storage — write sequentially at the tail of a log; never edit in place.
  • Columnar formats on HDD — large contiguous column runs.
  • Kafka's partition-as-log — the canonical production example of an end-to-end system designed around this property.

Why the log structure wins on HDD

Kozlovski's Kafka-101 framing:

"It's immutable and has O(1) writes and reads (as long as they're from the tail or head). […] the key benefit of the log and perhaps the chief reason it was chosen for Kafka is because it is optimized for HDDs. HDDs are very efficient with relation to linear reads and writes, and due to the log's structure — linear reads/writes are the main thing you perform on it!" (Source: sources/2024-05-09-highscalability-kafka-101)

Kafka composes this with two OS-level accelerators:

  • Read-ahead — the kernel prefetches large block multiples ahead of the current read position, so the next sequential read hits RAM not disk.
  • Write-behind — the kernel coalesces small logical writes into bigger physical writes at disk-flush time.

Kozlovski: "Kafka does not use fsync, its writes get written to disk asynchronously."

HDD economics — why the optimization is still worth making

From the companion 2024-03-06 Kozlovski S3 post (cross-linked by Kafka-101):

"As we covered in our S3 article, HDDs have become 6,000,000,000 times cheaper (inflation-adjusted) per byte since their inception." (Source: sources/2024-05-09-highscalability-kafka-101, citing sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale)

The full S3-side numbers (concepts/hard-drive-physics):

  • 1956 RAMAC: $9k, 3.75MB
  • 2024: 26TB at $15/TB
  • 6 billion× cheaper per byte (inflation-adjusted)
  • 7.2M× more capacity, 5000× smaller, 1235× lighter
  • but still ~120 IOPS/drive, flat for decades

The IOPS ceiling is flat because head-seek physics is flat. The capacity/$ win is only realisable by workloads that don't need random IOPS — i.e., workloads that are sequential by construction. Kafka's architecture is one of the cleanest instances of this economic alignment.

Kozlovski's doctrine statement

"Kafka's architecture is optimized for a cost-efficient on-premise deployment of a system that stores a lot of data while also being very performant!" (Source: sources/2024-05-09-highscalability-kafka-101)

"A lot of data" = capacity/$ win (HDDs are cheap by the byte). "Very performant" = sequential access (HDDs are fast for sequential). The design constraint that unlocks both is never seek.

Trade-offs / when it breaks

  • Historical reads break the invariant. Consumers reading far behind the tail miss pagecache and force the broker to seek back to older segment files on HDD — exhausting IOPS and competing with producers. This is the IOPS wall in concepts/log-recovery-time's four structural walls and the motivation for Tiered Storage: historical reads go to an object store, not to the broker's HDD.
  • SSDs change the equation. Sequential/random gap on SSDs is much smaller than on HDD; log-structured designs are still good (write amplification, wear levelling) but the urgency is lower.

Seen in

  • sources/2024-05-09-highscalability-kafka-101 — canonical statement of the log-on-HDD alignment.
  • sources/2024-03-06-highscalability-behind-aws-s3s-massive-scale — the companion economic framing (capacity/$ vs IOPS/drive) the Kafka-101 post cites.
  • sources/2025-03-04-meta-a-case-for-qlc-ssds-in-the-data-center — Meta's 2025 framing is the cousin case to the Kafka-on-HDD story: even the workloads that are sequential by construction (batch IO, read-BW-intensive) are falling off the bottom of the HDD BW/TB band as drive densities scale capacity but not throughput. Kafka's tiered storage pushes historical (random-read) traffic to object storage; Meta's QLC tier catches the sequential-BW-bounded workloads one layer down. The HDD sequential-I/O optimisation continues to work for what remains on HDD, but the set of workloads that fit that tier keeps shrinking.
Last updated · 319 distilled / 1,201 read