CONCEPT

Disk block size alignment¶

On-disk data structures achieve their best performance when the unit of the data structure (node, page, block) matches the unit of the storage device's I/O. HDDs and SSDs "read and write data in units called blocks" typically sized at 4 KB, 8 KB, or 16 KB (4 k / 8 k / 16 k), and "a single disk will have a capacity of many millions or billions of blocks." (Source: .)

A single I/O operation reads or writes one block. Reading half a block costs the same as reading a whole block; reading a block and a half costs two blocks. Consequently, any structure sized smaller than a block wastes bandwidth, and any structure sized non-integer multiples of a block requires extra I/Os at the boundaries.

Why B-trees align with blocks¶

B-tree nodes are typically sized to exactly one block (or a small integer multiple). The result:

One disk seek + one block read → one node. Tree depth becomes the dominant cost, not internal node cost.
High fan-out per node. A 16-KB node with 8-byte keys/values/pointers fits 682 entries. "A three level tree could store over 300 million key/value pairs (682 × 682 × 682 = 317,214,568)."
Shallow trees. Fewer levels to descend → fewer disk reads per lookup.

The disk-block-alignment argument is the root reason B-trees dominate on-disk index structures over RAM-optimal structures like red-black trees (one key per node, high depth — fine for memory, pathological for disk).

InnoDB default¶

MySQL's InnoDB defaults to a 16-KB page size — chosen to match the historical HDD / SSD block multiples. Alternatives are 4 KB, 8 KB, 32 KB, 64 KB at table-create time. The InnoDB page is simultaneously:

The B+tree node size in the clustered index + all secondary indexes.
The I/O unit from disk.
The cache unit in the buffer pool.

These three things being identical is not accident — the entire engine is built around the page as the atomic unit of I/O + cache + structure.

RAM vs disk addressability asymmetry¶

RAM on the other hand is typically addressable on a per-byte level.

—

RAM reads can be as small as one byte with ~100 ns latency. Disk reads have ~0.1 ms (NVMe) to ~10 ms (HDD) latency per I/O regardless of size — so the amortised cost per byte drops sharply as read size increases, up to some limit. Larger block sizes amortise seek cost better but increase read-amplification for tiny accesses. 16 KB is a compromise point that works for mixed OLTP workloads.

Disk-block-sized pages extend beyond B-trees¶

Object storage (S3, GCS) operates on multi-MB objects; range-get semantics let clients align their reads to page boundaries when layering a database on top. See patterns/vfs-range-get-from-object-store.
Log-structured merge trees (LSM) also use page-sized blocks inside SSTables for efficient sequential I/O.
Postgres uses 8-KB pages (default); its heap and B-tree indexes are both page-aligned.
SQLite uses configurable page sizes, default 4 KB.

Modern NVMe nuance¶

Modern NVMe SSDs can do random 4-KB reads at hundreds of thousands of IOPS — the "sequential I/O is faster than random I/O" argument that drove HDD-era design is quantitatively much weaker. But:

Smaller random accesses still pay the full per-I/O command-queue cost.
Page-aligned reads play better with OS page cache + file system block size (typically 4 KB).
Write amplification on SSD flash-translation layers still rewards aligned, page-sized writes.

So the alignment argument survives the transition from HDD to NVMe, just with weaker constants.

Seen in¶

— canonical articulation of why B-tree node size is chosen to match disk block size, with the 682³ three-level worked example.