Skip to content

PATTERN Cited by 2 sources

Broker write caching as client-tuning substitute

Problem

A Kafka-API streaming cluster is CPU-saturated because producer batches are too small, but the producers aren't tunable:

  • The producer fleet is owned by a different team on a different release cadence — coordinating linger.ms / batch.size adjustments is slow or impossible.
  • Producers span multiple languages / clients (Java, librdkafka-based Python, Go, Rust) with inconsistent defaults — a uniform policy can't be applied.
  • Topic / partition layout is locked by a downstream ordering contract — restructuring to reduce partition fan-out isn't an option.
  • Producers are third-party (external customers of a data-platform service) and cannot be reached at all.

The CPU-saturation pain is real; producer-side fix isn't.

Solution

Enable broker-side write caching (Redpanda) or rely on Kafka's OS buffer-cache default (legacy Kafka). The broker coalesces many small in-memory writes into larger disk flushes in the background, reclaiming the batching economics the producers aren't providing — at the cost of a bounded durability relaxation (ack-on-memory at quorum, flush to disk in background).

The equivalence frame, verbatim from Redpanda's 2024-11-26 part 2 (Source: sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2):

"Write caching is a mechanism Redpanda supports that helps alleviate the broker-side issue from having many tiny batches (or single message batches). This is especially useful for cases where your architecture makes it hard to do client-side tuning, change the producer behavior, or adjust topic and partition design."

And the durability equivalence:

"When write caching is enabled in Redpanda, the data durability guarantees are relaxed but no worse than a legacy Kafka cluster."

Mechanism summary

Default:           Client → broker → disk → ack
With write caching: Client → broker → memory → ack
                                   → (background) flush large block → disk

See concepts/broker-write-caching for the full state machine.

When to prefer this over client-side tuning

Signal Prefer client-side tuning Prefer broker write caching
Producer ownership Same team Other team / third-party
Client-library diversity Single library Multi-language fleet
Partition layout flexibility Modifiable Frozen by contract
Durability strictness Disk-fsync required Legacy-Kafka durability acceptable
Tuning iteration speed Fast (single-team deploys) Slow (cross-team coordination)

The decision is primarily organisational, not technical. The technical floor (Kafka-legacy-durability equivalence) is acceptable for the vast majority of streaming workloads.

When not to use

  • Workloads with hard synchronous-durability SLAs (financial systems requiring disk-fsync-before-ack, mission-critical append logs). Kafka-legacy durability does not survive simultaneous leader + follower-quorum memory loss.
  • Workloads where the producer fleet is tunable — prefer iterative linger tuning because producer-side fixes reduce network and CPU cost, whereas broker-side caching only addresses the disk-write and commit-latency axes.

Compose with client-side tuning

Write caching and client-side batching are additive, not exclusive. If both are feasible:

  1. Start with client-side tuning (reduces network bandwidth + producer CPU + broker CPU).
  2. Enable write caching for the residual small-batch workloads that can't be tuned (reduces disk-write amplification + commit latency for those).

Seen in

Last updated · 470 distilled / 1,213 read