Skip to content

CONCEPT Cited by 1 source

SSD garbage collection

Definition

SSD garbage collection is the firmware-level background process that reclaims dirty pages on an SSD, which — because NAND flash can only be erased at block granularity, not page granularity — requires moving live pages off a block, then erasing the whole block.

It is the hidden latency tax of SSDs: under sustained write pressure, foreground writes can be stalled by back-of-drive GC activity, producing non-deterministic write latency unrelated to the workload.

Why it's required

NAND-flash semantics force the sequence:

  1. Application issues a write to an already-used page.
  2. The FTL cannot overwrite in place — NAND cells must be erased before being written again, and erasure is block-granular.
  3. The FTL therefore writes the new data to a free page elsewhere, marking the old page dirty.
  4. Dirty pages accumulate. Eventually the drive runs low on free pages and must reclaim dirty ones.
  5. To erase a block of mostly-dirty pages, the FTL first copies the still-live pages elsewhere, then erases the block.

The copy step is the write-amplification source. Every logical write can trigger 1.1×–5× physical writes depending on how scattered the dirty pages are.

See concepts/nand-flash-page-block-erasure for the underlying mechanics.

Dicken's framing

"Each SSD needs to have an internal algorithm for managing which pages are empty, which are in use, and which are dirty. A dirty page is one that has been written to but the data is no longer needed and ready to be erased. Data also sometimes needs to be re-organized to allow for new write traffic. The algorithm that manages this is called the garbage collector."

[…]

"When SSDs have a lot of reads, writes, and deletes, we can end up with SSDs that have degraded performance due to garbage collection. Though you may not be aware, busy SSDs do garbage collection tasks regularly, which can slow down other operations."

(Source: sources/2025-03-13-planetscale-io-devices-and-latency)

Worked example from the post

Two scenarios both write 5 pages to a 4-target SSD:

Initial state Writes complete via Latency
Plenty of free pages in the first target Direct writes to free pages Fast
2 free pages total, several dirty pages scattered FTL must move live pages off one target → erase → write 5 new pages Noticeably slower

The second scenario is the common case under sustained writes on a drive that's been running a while — newly provisioned drives GC rarely; aged drives GC constantly.

Why it matters for databases

  • Tail latency spikes. A transaction commit that takes 50 μs at median may see 500 μs–5 ms at p99.9 when GC hits the blocks holding its WAL pages. See concepts/tail-latency-at-scale.
  • Over-provisioning mitigates. Leaving 20–30% of the drive unused gives the FTL more free-page headroom before GC kicks in. Enterprise SSDs ship with dedicated over-provisioning regions.
  • Sequential-write workloads GC less. An LSM engine writing large SSTables aligned to block boundaries produces blocks of contiguous live-then-dead pages that GC cheaply ("erase a mostly-empty block"). B+tree engines that update pages in place scatter dirty pages across many blocks, multiplying GC work. See concepts/lsm-compaction.
  • TRIM / DISCARD helps. When the OS explicitly tells the drive a block is free, the FTL can erase early instead of carrying it through the compaction loop. See concepts/trim-discard-integration.

Relationship to software-level GC

concepts/garbage-collection (in the Magic Pocket / immutable-storage framing) is the software-layer analogue of SSD GC — both identify unreferenced units and then compact live ones elsewhere to reclaim space. The two are compositional:

Layer What it reclaims Unit
Application (Magic Pocket blob GC) Dereferenced blobs Object
Storage engine (LSM compaction, InnoDB free-list) Deleted rows / obsolete SSTables Page
Filesystem (ext4 fstrim) Freed filesystem blocks FS block
SSD firmware (FTL GC) Dirty NAND pages NAND block

Every layer pays its own compaction cost, every layer's throughput affects the one below.

Seen in

Last updated · 319 distilled / 1,201 read