CONCEPT Cited by 2 sources
NAND flash page/block erasure semantics¶
Definition¶
NAND flash is organised in a three-level hierarchy with asymmetric unit sizes for read, write, and erase:
| Level | Typical size | Operations allowed |
|---|---|---|
| Page | 4 KB–16 KB | Read, write (once after erase) |
| Block | 128–512 pages | Erase (all pages in the block, atomically) |
| Target (die / plane) | 16k blocks+ | Carries one I/O in flight on its lane |
The asymmetry is the load-bearing constraint behind every SSD design decision:
- Reads are at page granularity. The smallest chunk of data the drive will hand to the host is one page. Dicken: "SSDs read and write data at the page level, meaning they can only read or write full pages at a time. Even if you only need a subset of the data within, that is the unit that requests to the drive must be made in."
- Writes are at page granularity — but only to erased pages. "After a page is written to, it cannot be overwritten with new data until the old data has been explicitly erased."
- Erases are at block granularity. "The tricky part is, individual pages cannot be erased. When you need to erase data, the entire block must be erased, and afterwards all of the pages within it can be reused."
(Source: sources/2025-03-13-planetscale-io-devices-and-latency)
The dirty-page problem¶
When a page's data is no longer needed, the page becomes dirty — it's written-to but not usable for new data until the block is erased. The drive can't just erase the block because other pages in the same block may still be live. Reclaiming the space requires compaction: copy live pages to a fresh location, then erase the source block.
This compaction loop is concepts/ssd-garbage-collection.
Concrete capacity example (Dicken's numbers)¶
"Say each page holds 4096 bits of data (4k). Now, say each
block stores 16k pages, each target stores 16k blocks, and
our device has 8 targets. This comes out to
4k × 16k × 16k × 8 = 8,796,093,022,208 bits, or
8 terabytes."
Architectural consequences¶
- Write amplification. Moving live pages for GC means writing a page costs more physical writes than the logical one. See concepts/write-amplification.
- Endurance pressure. Every physical write cycles a NAND cell's P/E count; garbage-collected writes age cells faster. See concepts/write-endurance-nand.
- Padding / alignment matters. Writing small objects into 4 KB pages wastes space and multiplies GC pressure. Engine design decisions — LSM vs B-tree, row vs column layout — are partly about matching the page boundary. See concepts/disk-block-size-alignment.
- Block size ≠ filesystem block size. SSD internal block (128+ pages, ~512 KB–8 MB) is much larger than filesystem block (4 KB). The FTL (flash translation layer) hides this — but not its latency consequences.
- Density tiers change the math. QLC (4 bits/cell) has larger blocks and fewer P/E cycles than TLC, which worsens write-amplification cost. See concepts/qlc-read-write-asymmetry + systems/qlc-flash.
Why this ends up on a database blog¶
The Dicken post frames it this way: "Many software engineers don't have to think about this on a day-to-day basis, but those designing software like MySQL need to pay careful attention to what structures data is being stored in and how data is laid out on disk." The page/block asymmetry shows up as:
- LSM-tree engines (RocksDB, Cassandra) naturally align with block-erasure — writes are batched into large SSTables that fit NAND blocks cleanly. See concepts/lsm-compaction.
- B+tree engines (systems/innodb) have to fight harder — random page updates scattered across the tree create dirty pages throughout many blocks, each of which must be GC'd independently.
Seen in¶
- sources/2025-03-13-planetscale-io-devices-and-latency — canonical pedagogical framing with explicit arithmetic example.
- sources/2025-03-04-meta-a-case-for-qlc-ssds-in-the-data-center — complementary framing on P/E cycle pressure (concepts/write-endurance-nand) and how density tiers (TLC vs QLC) change both the block size and the erase-cycle budget.