CONCEPT Cited by 1 source
SSD parallelism via targets¶
Definition¶
SSD parallelism via targets is the hardware-level parallelism an SSD gets from having multiple independent NAND flash targets (dies / planes), each wired to the controller via its own dedicated line. Only one page can be in flight per line at a time, so throughput is gated by how evenly the host spreads its I/Os across targets.
Dicken's framing:
"Typically, each target has a dedicated line going from the control unit to the target. This line is what processes reads and writes, and only one page can be communicated by each line at a time. Pages can be communicated on these lines really fast, but it still does take a small slice of time. The organization of data and sequence of reads and writes has a significant impact on how efficiently these lines can be used." (Source: sources/2025-03-13-planetscale-io-devices-and-latency)
Concrete example from the post¶
Write 8 pages to an SSD with 4 targets:
| Layout | Slices used | Parallelism |
|---|---|---|
| 2 pages to each of 4 targets | 2 | Full (4-way) |
| All 8 pages to the same target | 8 | None (3 lines idle) |
Dicken: "Notice how only one line was used and it needed to write sequentially. All the other lines sat dormant."
Takeaway: SSD performance is not just a function of the drive spec — it depends on how the host software lays out writes to spread them across targets. Naive engines that stream a single large write into contiguous LBAs may serialise a chunk of it onto one target.
Where the host can (and can't) see targets¶
- NVMe exposes namespaces, not targets. The host sees logical block addresses (LBAs); the flash translation layer (FTL) on the drive maps LBAs to physical pages on specific targets.
- The FTL tries to spread LBAs across targets by default for wear leveling + throughput. But sequential large-writes still tend to cluster because the FTL groups them into a single NAND program page first.
- Multi-queue NVMe (many host submission/completion queues) lets the host issue many outstanding I/Os in parallel, which gives the FTL more options for spreading them across targets.
Architectural consequences¶
- Concurrent workloads benefit naturally. OLTP workloads with many independent transactions spread across LBAs extract most of the available parallelism.
- Bulk loads can underperform. A single-threaded bulk
INSERT … SELECTwriting a long contiguous LBA range may cluster onto a subset of targets. Breaking the load into parallel streams frequently restores throughput. - Queue depth matters. A drive with 16 targets will underperform a queue depth of 1. Real measurements saturate around queue depth 32–64 on consumer NVMe.
- Layout is an engine design axis. "Many software engineers don't have to think about this on a day-to-day basis, but those designing software like MySQL need to pay careful attention to what structures data is being stored in and how data is laid out on disk." — Dicken
Relationship to HDD parallelism¶
HDDs have one head per platter surface; a single drive cannot parallelise I/Os at all except by interleaving across (few) heads on a multi-platter drive. Modern drives typically expose NCQ / TCQ (a queue of pending commands) that lets the firmware reorder for seek efficiency — not true parallelism, just scheduling. SSD target-level parallelism is qualitatively different and gives a 4×–16× throughput edge for well-laid-out workloads.
Seen in¶
- sources/2025-03-13-planetscale-io-devices-and-latency — canonical teaching example (4 targets, 8 writes, spread vs clustered).