Skip to content

PATTERN Cited by 1 source

Userspace FTL via io_uring + ublk

Context

The Flash Translation Layer (FTL) in an SSD translates host LBAs to physical NAND addresses, performs wear-leveling, garbage collects, and manages bad blocks. Traditionally the FTL runs inside the drive's firmware — an embedded processor + DRAM on the SSD handles the entire policy. This is simple for the host but opaque: the host has no visibility into GC cycles, wear distribution, or write-coalescing decisions, and cannot influence them.

At the density + asymmetry levels of modern QLC flash (systems/qlc-flash), opacity at the host becomes load-bearing cost:

  • GC running during a latency-sensitive read window stalls the read.
  • Write-coalescing policy affects R/W arbitration — only the host knows whether a pending write is bulk or latency-dependent.
  • Firmware iteration cycle is months; software iteration is days.

The pattern

Move the FTL to userspace on the host. Expose the storage to applications as a regular block device via Linux's ublk (userspace block device driver) framework, which forwards block I/O from the kernel to a userspace daemon. Use io_uring as the zero-copy, high-throughput ring-buffer path between the kernel and the userspace daemon.

┌──────────────────────────────────────────┐
│     Application                          │
├──────────────────────────────────────────┤
│     Kernel block device (regular)        │
├──────────────────────────────────────────┤
│   ublk                                   │ ← syscall-free path
├──────────────────────────────────────────┤
│         io_uring (shared ring buffer)    │
├──────────────────────────────────────────┤
│  Userspace FTL daemon                    │ ← wear leveling, GC, mapping
├──────────────────────────────────────────┤
│  Raw flash device  |  NVMe block device  │
└──────────────────────────────────────────┘
  • ublk gives apps a regular block-device interface. No vendor library needed at the app layer.
  • io_uring is the submit/complete ring-buffer primitive (Linux 5.1+). Sharing pages between kernel and userspace enables zero-copy for DMA-able buffers.
  • The userspace FTL daemon owns wear-leveling, GC scheduling, mapping-table management — policy in userspace, data-path via io_uring.

Canonical instance

Meta's 2025-03-04 QLC post discloses Pure Storage's DirectFlash Module (DFM) + DirectFlash software using exactly this stack:

"The software stack in Pure Storage's solutions uses Linux userspace block device driver (ublk) devices over io_uring to both expose the storage as a regular block device and enable zero copy for data copy elimination — as well as talk to their userspace FTL (DirectFlash software) in the background. For other vendors, the stack uses io_uring to directly interact with the NVMe block device."

Two deployment shapes in the same Meta server:

  1. DFM path: ublk → io_uring → userspace DirectFlash FTL → DFM (raw flash).
  2. Standard NVMe QLC path: io_uring → NVMe block device (firmware FTL).

Both coexist in Meta's rack design because both fit the U.2-15mm slot.

Why this pattern works for asymmetric media

The R/W-asymmetry problem (concepts/qlc-read-write-asymmetry) is only solvable if the scheduler has full visibility into pending writes. On a firmware-FTL drive, writes may be internally queued and the kernel has no view into that state. The rate controller pattern is effectively blocked.

With host-side FTL, every write is visible; the userspace daemon can throttle, pace, coalesce, or prioritise before dispatching to the media. This is the composition: userspace FTL + rate controller is the software-side answer to QLC's media-level asymmetry.

Trade-offs

  • Complexity on the host. The stack now depends on vendor daemon + ublk + io_uring; fewer moving parts lived in the drive but the host has more.
  • Vendor runtime on the host. DirectFlash software ships with Pure Storage; cross-vendor flash swaps are harder.
  • CPU cost. Host cycles go to FTL work that a drive-embedded processor used to do.
  • Harder debugging. Userspace daemons can crash / hang / leak; firmware FTL failures were rare but recoverable by drive reset.

The trade is accepted when the visibility + policy control wins exceed the host-complexity costs, which tends to be true at hyperscale when media asymmetries or QoS requirements are tight.

Adjacent patterns

Seen in

Last updated · 319 distilled / 1,201 read