CONCEPT Cited by 1 source

Small-batch NVMe write amplification¶

Definition¶

Small-batch NVMe write amplification is the storage-substrate cost that a streaming broker incurs when producer batches are smaller than the NVMe SSD's 4 KB page-alignment unit. Each sub-4 KB batch still consumes a full 4 KB page on write, so the physical bytes written / logical bytes written ratio — write amplification — climbs as batch size shrinks below the 4 KB floor.

Redpanda's 2024-11-26 batch-tuning part 2 canonicalises the mechanism verbatim:

"NVMe storage tends to write out data in 4 KB aligned pages. No problem if your message batch is 4 KB or larger. But what happens if you're sending millions of tiny, single message batches per second? Each message will be written alone in a 4 KB sized write, no matter how small it is: causing a large degree of write amplification and inefficient use of the available disk IO. Small batches also use significant CPU, which may saturate the CPU and drive up end-to-end latency as the backlog of requests starts to pile up." (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2)

Mechanism¶

NVMe SSDs expose a page-aligned block interface with 4 KB as the standard page size (see concepts/disk-block-size-alignment). The storage stack issues page-aligned writes:

Batch ≥ 4 KB: one or more full pages, no slack — write amplification approaches 1.0.
Batch < 4 KB: a single page is consumed for the batch; the remainder of the page is wasted slack — write amplification = 4 KB / batch_size.
1 KB batch → WA = 4.0
500 B batch → WA = 8.0
100 B batch → WA = 40.0

At millions of batches per second on a high-volume topic, small batches consume 4× to 40× the NVMe write bandwidth their logical throughput would suggest, burning disk IOPS and NAND endurance budget (P/E cycles) in the process.

Recommended target¶

Redpanda's guidance is explicit: "We recommend targeting your high-volume workloads for at least a 4 kilobyte (KB) effective batch size and upwards of 16 KB where possible to really unlock performance."

4 KB floor: matches the NVMe page; eliminates the sub-page write amplification.
16 KB sweet spot: matches the default Kafka batch.size=16384 and aligns with 4-page writes, where sequential-write NAND economics kick in.

The CPU-saturation compound effect¶

Small batches don't just amplify disk writes — they also multiply CPU-side request cost. Each batch is a separate produce request with fixed-cost work at the broker (decode, replicate, commit). At millions of batches per second, the broker's CPU saturates on fixed-cost work, producing the scheduler queue backlog that the saturation-regime latency inversion describes.

Disk write amplification and CPU saturation compose: small batches are the single producer-side anti-pattern that hits both bottlenecks simultaneously.

Seen in¶

sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — canonical wiki source. Names the 4 KB NVMe page-alignment mechanism verbatim; the production case study where per-topic batches below 4 KB drove CPU saturation + tail-latency blowup is a worked example of the mechanism.

concepts/disk-block-size-alignment — the 4 KB page is the substrate.
concepts/write-amplification — this is one of several write- amplification mechanisms on the wiki (LSM compaction, erasure-coding replication, etc.).
concepts/nand-flash-page-block-erasure — why page-aligned writes matter physically.
concepts/effective-batch-size — the producer-side framework that determines whether the NVMe floor is crossed.
systems/nvme-ssd, systems/redpanda, systems/kafka.