Skip to content

META 2025-03-04 Tier 1

Read original ↗

Meta — A case for QLC SSDs in the data center

Summary

Meta's Data Center Engineering team makes the architectural case for QLC NAND flash as a new middle storage tier between HDD and TLC flash in hyperscale data centers. The motivating problem: HDD bandwidth-per-TB has been dropping as drive densities grow without proportional I/O improvements, stranding data on HDDs that would better live a tier up; TLC flash is too expensive to scale into that gap. QLC, formerly uneconomical due to low capacity points + write endurance, has matured — 2 Tb QLC dies + 32-die stacks are mainstream — making 6× byte-density-per-server the target. Meta is co-deploying Pure Storage's DirectFlash Module (DFM) (custom form factor, software-managed FTL) alongside standard NVMe QLC SSDs from multiple vendors. Form-factor pick is U.2-15mm (scaling to 512 TB) or DFM (scaling to 600 TB), rejecting E1.S (too small for QLC NAND-package count) and E3 (fragmented variants). Storage-software stack adapts via Linux userspace block drivers (ublk) over io_uring to expose the DFM as a regular block device with zero-copy I/O, and rate controllers + I/O schedulers arbitrate the concepts/qlc-read-write-asymmetry|4×+ read-vs-write bandwidth asymmetry so latency-sensitive reads don't queue behind writes.

Key takeaways

  • QLC fills a middle tier HDD can no longer serve. "QLC flash occupies a unique space in the performance spectrum in between HDDs and SSDs for servicing workloads that still depend upon performance at 10 MB/s/TB range i.e., where we had 16-20TB HDDs. Additionally there are workloads doing large batch IOs which do not need very high performance but still are in the 15-20 MB/s/TB range and use TLC flash today." Canonical statement of middle-tier media sitting on the BW/TB spectrum between HDD (~5-10 MB/s/TB at today's densities) and TLC (~50-100 MB/s/TB). (Source: original article)

  • HDD BW/TB is dropping as capacity climbs. "Today, HDDs are the go-to storage solution for most data centers because of their lower cost and power footprint compared to other solutions like TLC flash. But while HDDs are growing in size, they haven't been growing in I/O performance. In other words, the bandwidth per TB for HDDs has been dropping." Extends concepts/hard-drive-physics's flat-IOPS observation (from the Warfield / S3 framing) — same underlying physics, measured on the bandwidth axis rather than the random-IOPS axis. Result: "a portion of hot workloads [are] forced to get stranded on HDDs."

  • QLC historical gap closed on three dimensions. "QLC flash as a technology has been around since 2009. Adoption has been slow because it has historically operated at lower drive capacity points — less than 32TB. As well, high cost and limited write endurance didn't make it an attractive alternative to TLC in the datacenter." Mainstreaming of 2 Tb QLC NAND die + 32-die stacks is the density-scaling datum behind the 2025 reconsideration.

  • Target: read-bandwidth-intensive + low-write workloads. "The workloads being targeted are read-bandwidth-intensive with infrequent as well as comparatively low write bandwidth requirements. Since the bulk of power consumption in any NAND flash media comes from writes, we expect our workloads to consume lower power with QLC SSDs." NAND write-power is load-bearing for the power-efficiency argument — write-light workloads are cheap to run on QLC.

  • Meta × Pure Storage co-design. "Meta's storage teams have started working closely with partners like Pure Storage, utilizing their DirectFlash Module (DFM) and DirectFlash software solution to bring reliable QLC storage to Meta. We are also working with other NAND vendors to integrate standard NVMe QLC SSDs into our data centers." Extends Meta's existing OCP co-design lineage to a new flash-media partner (Pure Storage) alongside the existing partner roster (Microsoft, NVIDIA, AMD).

  • Explicit honesty on pricing gap. "While today QLC is lower in cost than TLC, it is not yet price competitive enough for a broader deployment." The deployment is justified today by power-efficiency gains + HDD-cold-ification ("HDDs are continuing to get colder as their density increases"), not by total-cost-per-byte parity. NAND cost structures are expected to improve over time.

  • Form-factor argument — U.2-15mm wins, E1.S rejected. "While E1.S as a form factor has been great for our TLC deployments, it's not an ideal form factor to scale our QLC roadmap because its size limits the number of NAND packages per drive. The Industry standard U.2-15mm is still a prevalent form factor across SSD suppliers and it enables us to potentially scale to 512TB capacity. E3 doesn't bring additional value over U.2 at the moment and the market adoption split between the 4 variants of E3 makes it less attractive. Pure Storage's DFMs can allow scaling up to 600TB with the same NAND package technology." Form-factor matters: surface area / volume / NAND-package count / power envelope all interact. E3's 4-variant fragmentation is the named reason to skip the standard altogether.

  • Server-level CPU/memory/network must move with media. "Within Meta, the byte density target of the QLC-based server is 6x the densest TLC-based server we ship today. Even though the BW/TB expected of QLC is lower than TLC, the QLC server bytes density requires a more performant CPU, faster memory and network subsystem to take advantage of the media capabilities." A 6× density bump raises aggregate server throughput expectations beyond any single Meta server to date, even at lower per-TB BW — the CPU/memory/network subsystem must be sized for the new regime.

  • Storage software: ublk + io_uring + userspace FTL. "The software stack in Pure Storage's solutions uses Linux userspace block device driver (ublk) devices over io_uring to both expose the storage as a regular block device and enable zero copy for data copy elimination — as well as talk to their userspace FTL (DirectFlash software) in the background. For other vendors, the stack uses io_uring to directly interact with the NVMe block device." Canonical userspace-FTL pattern for vendor-specific flash management — ublk gives the appearance of a regular block device while the FTL runs in userspace for control + performance.

  • Read-vs-write asymmetry is load-bearing. "QLC SSDs have a significant delta between read and write throughput. Read throughput in the case of QLC can be as high as 4x or more than write throughput. What's more, typical use cases around reads are latency sensitive so we need to make sure that the I/O delivering this massive read BW is not getting serialized behind the writes. This requires building, and carefully tuning, rate controllers and I/O schedulers." Canonical wiki statement of concepts/qlc-read-write-asymmetry as a media-level architectural constraint the software must counter — analogous to the EBS / ShardStore tail-latency concerns, but rooted in the flash cell's physical write behaviour not a placement problem.

Systems extracted

  • systems/qlc-flash — Quad-Level Cell NAND flash. 4 bits per cell → higher density, lower endurance, asymmetric read/write BW. 2009 invention; 2025 Meta-scale data-center deployment.
  • systems/tlc-flash — Triple-Level Cell NAND flash. 3 bits per cell. Meta's existing data-center flash tier; QLC is positioned below it in the BW/TB / cost / endurance hierarchy.
  • systems/pure-storage-directflash-module — Pure Storage's DFM, a custom flash module with software-managed FTL (DirectFlash software). Scales to 600 TB per drive; physically fits into a U.2-15mm slot.
  • systems/u2-15mm-form-factor — Industry-standard 15mm U.2 SSD form factor. Meta's chosen QLC slot; scales to 512 TB; also accepts DFMs, enabling vendor-diverse slot reuse.
  • systems/e1s-form-factor — EDSFF E1.S. Meta's current TLC flash form factor. Rejected for QLC scaling because its volume limits NAND-package count per drive.

Concepts extracted

  • concepts/bandwidth-per-terabyte — the BW/TB axis along which Meta positions QLC in the media hierarchy. New wiki concept. Canonical framing: "workloads that still depend upon performance at 10 MB/s/TB range" is HDD territory that HDD is leaving behind; "workloads doing large batch IOs which do not need very high performance but still are in the 15-20 MB/s/TB range" is where QLC beats both HDD (too slow) and TLC (overpaid).
  • concepts/qlc-read-write-asymmetry — 4×+ read-vs-write throughput delta in QLC media. Architectural consequence: reads cannot be allowed to queue behind writes → rate controllers + I/O schedulers required.
  • concepts/storage-media-tiering — the general structural concept of positioning a new media tier between two incumbents. QLC is the wiki's canonical instance.
  • concepts/write-endurance-nand — P/E-cycle limit that made QLC historically unsuitable for high-write workloads; mitigated by modern FTL + workload matching.
  • concepts/hard-drive-physics (extended) — BW/TB decline is a new framing of the same flat-physics observation Warfield made on the IOPS axis. Same underlying cause (head seeks don't scale with density); different axis.
  • concepts/hdd-sequential-io-optimization (extended) — the sequential-I/O design stance still works at the log-structured layer, but even sequential-only workloads have a BW/TB floor that current HDDs are falling below.

Patterns extracted

  • patterns/middle-tier-storage-media — introduce a new media tier between two incumbents when the lower tier can no longer cover the BW/TB range + the upper tier costs too much. Discipline: match workload shape (read-dominant, batch-IO, low-write-BW) to the new tier's strengths.
  • patterns/userspace-ftl-via-io-uring — expose flash to applications as a regular block device via ublk while the FTL runs in userspace for vendor-specific control + zero-copy I/O. io_uring is the ring-buffer primitive that makes the path performant.
  • patterns/rate-controller-for-asymmetric-media — when media read/write throughputs differ by multiples, the software stack must arbitrate admission so latency-sensitive reads don't queue behind writes.
  • patterns/co-design-with-ocp-partners (extended) — Meta × Pure Storage is a new co-design partnership on flash media, alongside the existing Meta × Microsoft OCP lineage (SAI → OAM → Mount Diablo).

Architectural shape

                          Workload BW/TB requirement
                          ─────────────────────────→
                          low                     high
       ┌──────────────────────────────────────────────────┐
       │                                                  │
 HDD   │ ▓▓▓▓▓▓▓▓▓▓  falling as density grows             │
       │                                                  │
 QLC   │      (new tier)  ▓▓▓▓▓▓▓▓▓▓                      │
       │                                                  │
 TLC   │                      ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │
       │                                                  │
       └──────────────────────────────────────────────────┘
         ~5 MB/s/TB       ~10-20 MB/s/TB        ~50+ MB/s/TB

Form-factor evaluation matrix:

Form factor Meta position Rationale
E1.S TLC only Too small for QLC NAND-package count
U.2-15mm Primary Industry standard; scales to 512 TB; accepts DFMs
E3 Skip 4-variant fragmentation; no win over U.2
DFM (Pure Storage) Primary (partner) Scales to 600 TB; fits U.2 slots; userspace FTL

Software stack for QLC:

┌──────────────────────────────────────────┐
│     Meta storage application             │
├──────────────────────────────────────────┤
│     Block device (regular semantics)     │
├──────────────────────────────────────────┤
│   ublk (userspace block device driver)   │  ← zero-copy path
├──────────────────────────────────────────┤
│         io_uring ring buffer             │
├──────────────────────────────────────────┤
│  Userspace FTL (DirectFlash) | NVMe dev  │
├──────────────────────────────────────────┤
│   Pure Storage DFM   |   NVMe QLC SSD    │
└──────────────────────────────────────────┘

Numbers disclosed

  • 2 Tb QLC NAND die + 32-die stacks — mainstream density milestones making QLC competitive.
  • — byte density target of QLC server vs densest TLC server Meta ships today.
  • ~10 MB/s/TB — lower-bound BW/TB range for workloads HDD used to cover (16-20 TB HDDs).
  • 15-20 MB/s/TB — BW/TB range for large-batch-IO workloads currently on TLC.
  • 4×+ — read vs write throughput asymmetry in QLC.
  • 512 TB — maximum drive capacity scalable in U.2-15mm with standard NVMe QLC SSDs.
  • 600 TB — maximum drive capacity in Pure Storage DFM form factor.
  • 32 TB — historic QLC capacity ceiling that made it uncompetitive.
  • ~2009 — year QLC technology was first available.

Caveats

  • Announcement-voice / strategic-case post, not a retrospective. No production-scale deployment numbers (no fleet size, no number of QLC racks deployed, no bytes stored on QLC to date, no migration completion %).
  • No workload-level benchmarks. Target workload categories named abstractly ("read-bandwidth-intensive with infrequent write bandwidth", "large batch IOs") — no named Meta production application identified as the QLC pilot customer.
  • No TCO numbers. Meta is honest that QLC is not price-competitive with TLC yet; the real TCO comparison (QLC vs HDD for stranded-hot-on-HDD data; QLC vs TLC for batch-IO data) is not disclosed. The ambiguity is: is QLC deployed today at scale, or is this 2025 post the pre-deployment announcement?
  • No rate-controller design detail. The "carefully tuning rate controllers and I/O schedulers" claim is gestured at but not specified — what scheduler, what algorithm, what knobs, no before/after latency numbers.
  • No DFM internals disclosed. Pure Storage's DirectFlash software (userspace FTL) is a third-party product; its architecture, wear-leveling strategy, GC behavior not covered.
  • No multi-vendor co-existence detail. "working with other NAND vendors to integrate standard NVMe QLC SSDs" — vendor set not enumerated, integration burden not characterised.
  • Write-endurance workload-match not quantified. QLC's lower endurance is mentioned as a historic blocker now addressed by workload matching; the actual DWPD or TBW endurance figures are not given.
  • Server CPU/memory/network design is said to need scaling for the 6× density, but the specific server design (CPU SKU, memory topology, NIC speed, PCIe lane count) is not disclosed.
  • No NAND-writes-power percentage. The argument that power saves come from writes being rare is gestured at without a "X% of TLC power goes to writes" datum.
  • DRAM-less vs DRAM-based QLC — the architectural choice, known to impact FTL mapping-table caching + cost, is not mentioned.

Source

Last updated · 319 distilled / 1,201 read