CONCEPT Cited by 1 source
Disk block size alignment¶
On-disk data structures achieve their best performance when the unit of the data structure (node, page, block) matches the unit of the storage device's I/O. HDDs and SSDs "read and write data in units called blocks" typically sized at 4 KB, 8 KB, or 16 KB (4 k / 8 k / 16 k), and "a single disk will have a capacity of many millions or billions of blocks." (Source: sources/2024-09-09-planetscale-b-trees-and-database-indexes.)
A single I/O operation reads or writes one block. Reading half a block costs the same as reading a whole block; reading a block and a half costs two blocks. Consequently, any structure sized smaller than a block wastes bandwidth, and any structure sized non-integer multiples of a block requires extra I/Os at the boundaries.
Why B-trees align with blocks¶
B-tree nodes are typically sized to exactly one block (or a small integer multiple). The result:
- One disk seek + one block read → one node. Tree depth becomes the dominant cost, not internal node cost.
- High fan-out per node. A 16-KB node with 8-byte keys/values/pointers fits 682 entries. "A three level tree could store over 300 million key/value pairs (682 × 682 × 682 = 317,214,568)."
- Shallow trees. Fewer levels to descend → fewer disk reads per lookup.
The disk-block-alignment argument is the root reason B-trees dominate on-disk index structures over RAM-optimal structures like red-black trees (one key per node, high depth — fine for memory, pathological for disk).
InnoDB default¶
MySQL's InnoDB defaults to a 16-KB page size — chosen to match the historical HDD / SSD block multiples. Alternatives are 4 KB, 8 KB, 32 KB, 64 KB at table-create time. The InnoDB page is simultaneously:
- The B+tree node size in the clustered index + all secondary indexes.
- The I/O unit from disk.
- The cache unit in the buffer pool.
These three things being identical is not accident — the entire engine is built around the page as the atomic unit of I/O + cache + structure.
RAM vs disk addressability asymmetry¶
RAM on the other hand is typically addressable on a per-byte level.
— sources/2024-09-09-planetscale-b-trees-and-database-indexes
RAM reads can be as small as one byte with ~100 ns latency. Disk reads have ~0.1 ms (NVMe) to ~10 ms (HDD) latency per I/O regardless of size — so the amortised cost per byte drops sharply as read size increases, up to some limit. Larger block sizes amortise seek cost better but increase read-amplification for tiny accesses. 16 KB is a compromise point that works for mixed OLTP workloads.
Disk-block-sized pages extend beyond B-trees¶
- Object storage (S3, GCS) operates on multi-MB objects; range-get semantics let clients align their reads to page boundaries when layering a database on top. See patterns/vfs-range-get-from-object-store.
- Log-structured merge trees (LSM) also use page-sized blocks inside SSTables for efficient sequential I/O.
- Postgres uses 8-KB pages (default); its heap and B-tree indexes are both page-aligned.
- SQLite uses configurable page sizes, default 4 KB.
Modern NVMe nuance¶
Modern NVMe SSDs can do random 4-KB reads at hundreds of thousands of IOPS — the "sequential I/O is faster than random I/O" argument that drove HDD-era design is quantitatively much weaker. But:
- Smaller random accesses still pay the full per-I/O command-queue cost.
- Page-aligned reads play better with OS page cache + file system block size (typically 4 KB).
- Write amplification on SSD flash-translation layers still rewards aligned, page-sized writes.
So the alignment argument survives the transition from HDD to NVMe, just with weaker constants.
Seen in¶
- sources/2024-09-09-planetscale-b-trees-and-database-indexes — canonical articulation of why B-tree node size is chosen to match disk block size, with the 682³ three-level worked example.