CONCEPT Cited by 1 source

SSD parallelism via targets¶

Definition¶

SSD parallelism via targets is the hardware-level parallelism an SSD gets from having multiple independent NAND flash targets (dies / planes), each wired to the controller via its own dedicated line. Only one page can be in flight per line at a time, so throughput is gated by how evenly the host spreads its I/Os across targets.

Dicken's framing:

"Typically, each target has a dedicated line going from the control unit to the target. This line is what processes reads and writes, and only one page can be communicated by each line at a time. Pages can be communicated on these lines really fast, but it still does take a small slice of time. The organization of data and sequence of reads and writes has a significant impact on how efficiently these lines can be used." (Source: sources/2025-03-13-planetscale-io-devices-and-latency)

Concrete example from the post¶

Write 8 pages to an SSD with 4 targets:

Layout	Slices used	Parallelism
2 pages to each of 4 targets	2	Full (4-way)
All 8 pages to the same target	8	None (3 lines idle)

Dicken: "Notice how only one line was used and it needed to write sequentially. All the other lines sat dormant."

Takeaway: SSD performance is not just a function of the drive spec — it depends on how the host software lays out writes to spread them across targets. Naive engines that stream a single large write into contiguous LBAs may serialise a chunk of it onto one target.

Where the host can (and can't) see targets¶

NVMe exposes namespaces, not targets. The host sees logical block addresses (LBAs); the flash translation layer (FTL) on the drive maps LBAs to physical pages on specific targets.
The FTL tries to spread LBAs across targets by default for wear leveling + throughput. But sequential large-writes still tend to cluster because the FTL groups them into a single NAND program page first.
Multi-queue NVMe (many host submission/completion queues) lets the host issue many outstanding I/Os in parallel, which gives the FTL more options for spreading them across targets.

Architectural consequences¶

Concurrent workloads benefit naturally. OLTP workloads with many independent transactions spread across LBAs extract most of the available parallelism.
Bulk loads can underperform. A single-threaded bulk INSERT … SELECT writing a long contiguous LBA range may cluster onto a subset of targets. Breaking the load into parallel streams frequently restores throughput.
Queue depth matters. A drive with 16 targets will underperform a queue depth of 1. Real measurements saturate around queue depth 32–64 on consumer NVMe.
Layout is an engine design axis. "Many software engineers don't have to think about this on a day-to-day basis, but those designing software like MySQL need to pay careful attention to what structures data is being stored in and how data is laid out on disk." — Dicken

Relationship to HDD parallelism¶

HDDs have one head per platter surface; a single drive cannot parallelise I/Os at all except by interleaving across (few) heads on a multi-platter drive. Modern drives typically expose NCQ / TCQ (a queue of pending commands) that lets the firmware reorder for seek efficiency — not true parallelism, just scheduling. SSD target-level parallelism is qualitatively different and gives a 4×–16× throughput edge for well-laid-out workloads.

Seen in¶

sources/2025-03-13-planetscale-io-devices-and-latency — canonical teaching example (4 targets, 8 writes, spread vs clustered).