Redpanda — Need for speed: 9 tips to supercharge Redpanda¶
Summary¶
A numbered checklist of nine performance-tuning tips for Redpanda clusters, framed around the infrastructure → data architecture → application design triad. The post reads as an omnibus digest — a compressed tour of the knobs covered at depth by Redpanda's other engineering posts — but the checklist structure itself, the dedicated-hardware-with-95%-resource-budget deployment rule, the partition-skew-as-Amdahl's-law framing, the consumer throughput-vs-latency parameter table, the auto-commit-frequency = save-button analogy, the compression-plus-compaction CPU-cost interaction, and the tiered-storage-accelerates- decommission-and-recommission claim are the load-bearing contributions that were not yet canonicalised on the wiki.
Tier-3 borderline-on-scope: Redpanda vendor-blog checklist shape is normally a skip signal, but this post explicitly covers distributed-streaming-substrate internals (producer partitioning, consumer fetch tuning, offset-commit cost, compression + compaction interaction, tiered-storage availability benefits, broker deployment isolation) and names several novel substrate trade-offs that don't appear in the prior Kinley / Dhanushka / Gallego ingests. Architecture density ~60% of a ~1,500-word body. Bylined to Redpanda (no individual author).
Key takeaways¶
- Build infrastructure for performance, then give Redpanda 95% of it. The post's opening tenet, verbatim: "Deploy hardware that meets (and preferably exceeds) our minimum hardware requirements. Use local NVME for storage rather than spinning disks or remote storage. Run the brokers on dedicated machines (whether bare metal, VM, or containers in K8s) — no noisy neighbors! Give Redpanda 95% of the available resources (always good practice to leave some room for the OS / Kubernetes host). Monitor your resource usage and be prepared to scale as your app grows." Resource-budget discipline is explicit: leave 5% for OS + k8s host, target 95% for the broker. Canonical hardware tenet for thread-per-core brokers on NVMe. (Source: sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda)
- Write caching is the storage-hardware-shortfall mitigation.
Names "SSDs, spinning disks, SAN, remote storage — anything
other than locally attached NVME" as the canonical trigger for
enabling broker-side write
caching. Always pair with
acks=allto preserve the "in memory on a majority of brokers" guarantee; consider multi-AZ for blast-radius limits. Verbatim positioning: "you're definitely trading durability for performance, so make sure you've considered this." Canonical wiki concept already exists (concepts/broker-write-caching + patterns/broker-write-caching-as-client-tuning-substitute); this post adds the hardware-shortfall-as-trigger framing missing from the prior Kinley case study (which framed it as an organisational substitute, not a hardware substitute). - Partition skew is Amdahl's Law for streaming. The load-bearing metaphor, verbatim: "Think of this as the data equivalent of Amdahl's Law: data skew is the enemy of parallelization, limiting the benefits of scaling out by using more partitions. If 90% of your data goes through a single partition, then whether you have 10 partitions or 50 won't really make a difference since that single overworked partition is your limiting factor." Canonicalised as concepts/partition-skew-data-skew. The three-pronged mitigation: (a) use the sticky partitioner for unkeyed records; (b) only use keyed partitioning when CDC or ordering genuinely requires it; (c) when keys are required, pick high-cardinality keys — canonicalised as patterns/high-cardinality-partition-key.
- Batch tuning is the producer's free efficiency win.
Restates the batching
trade-off canonicalised by the 2024-11-19 Kinley part 1:
"Batching does mean intentionally introducing latency into
the produce pipeline, but it's often a worthy tradeoff and can
lead to lower latency overall since the broker is more
efficient." The counterintuitive regime — bigger batches
lower end-to-end latency under broker saturation — is the
canonical framing. Action items: raise
linger.ms+batch.size; monitor average-batch-size and per-topic batch rate (the per-topic diagnosis discipline). - Consumers have their own performance dial — and it's
defaults-unsafe. Canonicalises a four-parameter consumer
fetch-tuning axis with low-latency-vs-high-throughput
recommendations, verbatim table:
| Parameter | Low Latency | Default | High Throughput |
|---|---|---|---|
|
fetch.min.bytes| 1 B | 1 B | 1 MB+ | |fetch.max.wait.ms| < 50 ms | 500 ms | > 1000 ms | |max.partition.fetch.bytes| < 100 KB | 1 MB | > 10 MB | |max.poll.records| < 100 | 500 | > 5000 | Canonicalised as concepts/consumer-fetch-tuning. The consumer-side counterpart to Kinley's producer-side linger.ms/batch.size axis: the broker serves fetches in the same fixed-vs-variable-cost regime as produce, and the consumer dial chooses its operating point. - Auto-commit frequency is the consumer's save button — don't
press it per word. Verbatim analogy:
"When consuming a topic, committing your consumer group
offsets is exactly like pressing the save button. You record
where you've read to, and just like that save, each commit
takes time and resources. If you commit too frequently, your
consumer will be less efficient, but it can also start to
impact the broker as your consume workload gradually transforms
(somewhat unknowingly) into a consume AND produce workload,
since each read is accompanied by a commit write."
Recommendations:
auto.commit.interval.ms ≥ 1 s(default is 5 s; low-ms is "right out"); align manual commits to ≥ 1 s; use RPO to set the minimum commit frequency (consumers committing every 10 s at worst lose 10 s of re-read on restart — usually acceptable); one consumer group per application / micro-service (shared groups burden the group coordinator). Canonicalised as concepts/offset-commit-cost. - Compression codec choice — ZSTD or LZ4. Verbatim:
"Use ZSTD or LZ4 for a good balance between compression ratio
and CPU time if compression is essential." The space-vs-CPU
trade-off is classic, but the post adds a load-bearing
deployment rule: compress on the client, not the broker
(topic compression setting should be
producer— pass-through the client's chosen codec). Canonicalised as concepts/compression-codec-tradeoff + the pattern patterns/client-side-compression-over-broker-compression. Batching and compression compound: "The compression ratio improves as you compress more messages at once since it can take advantage of the similarities between messages." (Restates Kinley 2024-11-19 takeaway 3 — compression ratio is a function of batch size.) - Compression + compaction = broker-side CPU on every compaction pass. The single novel-to-this-post substrate claim: the compaction process is "the only use case where the broker reads message-level details from a topic" — usually Redpanda treats records as opaque bytes. Combining compression with compaction forces the broker to decompress + recompress on every compaction pass, which can drive significant CPU utilisation with CPU-heavy codecs. Verbatim recommendation: "Don't compress compacted topics unless you're willing to spend the CPU cycles uncompressing and recompressing." Canonicalised as concepts/compression-compaction-cpu-cost. A compelling argument against ZSTD on log-compacted topics — choose LZ4, or skip compression entirely.
- Tiered storage is an availability / rebalance primitive, not just a storage-cost primitive. The standard pitch for tiered storage is capacity (hold more than your local disks). This post surfaces a load-bearing operational benefit: decommission and recommission are "orders of magnitude" faster because the data already lives in object storage: "When tiered storage is in use, decommissioning and recommissioning can both be sped up by orders of magnitude, since a copy of the data already exists out in the object store. This means only the most recent data (that is yet to be written to tiered storage) needs to be moved to or from a broker." Canonicalised as concepts/tiered-storage-fast-decommission. Related wiki: patterns/tiered-storage-to-object-store already canonicalises the storage-tier pattern from the Kozlovski Kafka-101 post — this adds a distinct operational-availability reason to adopt it.
Architectural framing¶
The nine tips as three tiers of concern¶
The post implicitly organises its nine tips across three dependency layers — each layer reusing the primitives of the one below:
- Infrastructure (tips 1–2) — hardware budget + durability compromise knob. You tune these once, at deployment. Applicable primitives: NVMe, dedicated hardware, no noisy neighbors, 95% resource budget, write caching when you can't run on NVMe.
- Data architecture (tips 3, 8, 9) — how records are laid out across the cluster. You tune these at topic-design time. Applicable primitives: partition skew, keyed vs unkeyed partitioning, compaction, tiered storage. Changes here are typically irreversible without a topic-migration workflow.
- Application design (tips 4–7) — how producers and consumers
speak to brokers. You tune these at client-config time.
Applicable primitives: batching,
acks, compression codec, consumer fetch tuning, offset-commit frequency, consumer group topology.
Partition skew as Amdahl's Law¶
Amdahl's Law: if α of a workload is serial, maximum speedup is
1 / α regardless of cores.
Applied to streaming: if α of records hash to partition P, the
per-partition throughput of P caps aggregate topic throughput at
throughput(P) / α. Adding partitions scales the non-skewed 1-α
fraction, but never P — more cores on a serial phase don't help.
The three mitigations map to Amdahl's escape routes:
- Sticky partitioner — reduce the effective key-cardinality of the workload to 1 (the sticky window), routing all records through the current sticky partition in batch-sized chunks. Amdahl doesn't apply because the workload is no longer partition-cardinality-constrained.
- Keyed partitioning only when required — don't impose
serialisation (
α > 0) unless an ordering / CDC contract demands it. - High-cardinality keys when required — maximise the
denominator of the key-to-partition mapping, so the largest
key gets ≤
1/Nof rows.
The consumer fetch-tuning axis¶
The four parameters define a 2D space: amount of data per fetch
request (fetch.min.bytes, max.partition.fetch.bytes,
max.poll.records) × maximum wait before fetch returns
(fetch.max.wait.ms). Low-latency sits at the origin (tiny
fetches, short waits); high-throughput sits at the far corner
(big fetches, long waits); the defaults sit in the middle and are
usually wrong for one or both regimes.
Consumer-side trade-off is dual to the producer-side batching trade-off — both choose how many records travel in one broker request to amortise the fixed per-request cost. The post makes this dual explicit: "the default settings of a consumer are a reasonable starting point, one size doesn't necessarily fit all. Most consumers will have a preference for either low latency or high throughput."
Offset-commit as implicit produce¶
The save-button analogy crystallises an under-stated cost:
a consumer that auto-commits every N ms is also a producer to
__consumer_offsets at that rate. The broker sees
(fetch-request-rate + commit-write-rate) as the real load.
Tie-in to RPO: for a consumer that re-reads on restart, the commit frequency sets the RPO floor for that consumer's processing. Commit every 1 s ⇒ at most 1 s of re-read on consumer restart. Commit every 10 s ⇒ at most 10 s of re-read. The knob is the RPO dial for the consume side.
Compression + compaction interaction¶
Normal broker I/O is opaque-byte handling: the broker never looks inside record payloads. Compaction violates this — the broker must read the record key to determine which records are superseded.
When compression is enabled, this read-key requirement forces decompress → read-key → rewrite per compaction pass, which compounds with compaction frequency to drive broker CPU. "Combining compression (particularly with CPU-intensive codecs) with compaction can lead to significant CPU utilization. Again, this is a classic trade-off between space utilization and CPU time."
The practical rule: if a topic must be compressed, choose LZ4 (low CPU cost on decompress/recompress); if a topic must be compacted, prefer not compressing at all; if both are required, accept the CPU tax and provision accordingly.
Tiered storage as rebalance accelerator¶
Standard pitch for tiered storage: "store more data than local disk allows" — a capacity-scaling story.
The post's reframe: data in object storage is already replicated away from any single broker, so when a broker leaves or joins the cluster, the cold bulk of its partitions don't need to move at all. Only the hot portion still pending offload (current segments on local disk) crosses broker-to-broker.
For a broker holding, say, 10 TB of partitions where 9.5 TB is in tiered storage and 0.5 TB is hot: - Without tiered storage: 10 TB must re-replicate to the new broker. - With tiered storage: 0.5 TB must re-replicate.
At cross-broker replication rates of ~GB/sec, the difference is minutes vs hours. Canonicalised as concepts/tiered-storage-fast-decommission.
Numbers and thresholds named¶
- Resource budget: 95% for Redpanda, 5% for OS / k8s host.
auto.commit.interval.msminimum sensible value: 1 s (default is 5 s; low-ms "right out").fetch.min.bytesrecommended high-throughput: 1 MB+ (default 1 B).fetch.max.wait.msrecommended high-throughput: >1000 ms (default 500 ms).max.partition.fetch.bytesrecommended high-throughput: >10 MB (default 1 MB).max.poll.recordsrecommended high-throughput: >5000 (default 500).- Compression 5:1 ratio → 80% bandwidth savings ("If you can compress messages at a ratio of 5:1, you can reduce what you would have sent by 80%, which helps every stage of the data lifecycle (ingestion, storage, and retrieval).")
- Tiered-storage decommission/recommission speedup: "orders of magnitude" (qualitative).
No production latency / throughput numbers, no customer case studies, no percentile tables — the post is a checklist, not a retrospective.
Caveats¶
- Vendor-blog checklist voice. No individual author byline; no customer data; no benchmark numbers. The recommendations are broadly sound but uncalibrated — the post says "consider", "try to", "generally" more often than "measure".
- Consumer fetch-tuning table is qualitative only. The
low-latency / default / high-throughput columns give order-of-
magnitude ranges (
<100 KB/1 MB/>10 MB) without workload-specific benchmark validation. Operators must measure their own workloads against the defaults. - Auto-commit-vs-manual-commit-vs-disable-auto-commit is
collapsed. The post discusses
auto.commit.interval.msand manual commits but doesn't walk the performance-vs-correctness trade-offs (at-least-once ⇒ commit after processing; commits before process-completion can drop records on consumer crash). - Keyed partitioning's ordering-contract motivation is asserted, not explained. CDC is named as the canonical justification without mechanism discussion; readers unfamiliar with CDC don't learn why per-key ordering matters.
- Compression + compaction CPU cost is asserted without numbers. No datapoint on how much broker CPU a compacted + ZSTD'd topic burns vs uncompressed-compacted or uncompacted.
- Tiered-storage decommission speedup is asserted as "orders of magnitude" — no numbers, no case study. A real production datum (e.g. "10 TB broker moves in 15 min instead of 3 h") would strengthen the claim.
- Multi-AZ + write caching interaction is noted but not walked. The post says consider multi-AZ when enabling write caching to reduce blast radius, but doesn't explain the failure-mode interaction (memory-loss-on-single-AZ survives if writes replicated to other-AZ memory; memory-loss-on-all-AZs still loses acked writes).
min.insync.replicasnot discussed. The write-caching +acks=alldiscussion elides the composition withmin.insync.replicaswhich is the canonical fail-closed ISR-floor composer. Operators enabling write caching should setmin.insync.replicas ≥ 2to prevent ISR-shrink silently collapsing toacks=1-equivalent durability.- No discussion of monitoring. Most recommendations require corresponding observability (per-topic average batch size, partition-level message rate, consumer-group commit rate, scheduler queue length for saturation detection) which this post doesn't cover — refer to the Kinley part 2 ops manual for the Prometheus cookbook.
Cross-source continuity¶
Fourth substantive Redpanda wiki ingest after Kinley 2024-11-19 part 1, Kinley 2024-11-26 part 2, Dhanushka 2025-01-21 Medallion, sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters|2025-02-11 stretch clusters, sources/2025-03-18-redpanda-3-powerful-connectors-for-real-time-change-data-capture|2025-03-18 CDC connectors, and Gallego 2025-04-03 autonomy vision. Where Kinley part 1+2 drilled into producer batching, this post widens the lens to the full producer + consumer + data-layout + deployment surface — effectively a sitemap of the Redpanda tuning universe. Each tip either restates a canonicalised primitive (tips 2, 4, partially 7) or fills a gap (tips 1, 3, 5, 6, 8, 9).
Composes with:
- Kozlovski Kafka 101 — tiered storage's canonical wiki explainer (capacity axis). This post adds the rebalance-acceleration axis to the same primitive.
- Dicken sharding — canonicalised shard-key cardinality in the MySQL sharding context. This post extends it to the Kafka-partition-key context.
- sources/2025-02-11-redpanda-high-availability-deployment-multi-region-stretch-clusters|2025-02-11 stretch clusters
— canonicalised
acks=1as the per-write latency-relief knob on stretch cluster. This post reaffirmsacks=allas the mandatory companion of write caching.
Source¶
- Original: https://www.redpanda.com/blog/top-performance-considerations-redpanda
- Raw markdown:
raw/redpanda/2025-04-23-need-for-speed-9-tips-to-supercharge-redpanda-108d487d.md
Related¶
- systems/redpanda — the system.
- systems/kafka — Kafka-API compatible; recommendations apply.
- companies/redpanda — company page.
- concepts/partition-skew-data-skew — tip 3's canonical concept.
- concepts/consumer-fetch-tuning — tip 5's canonical concept.
- concepts/offset-commit-cost — tip 6's canonical concept.
- concepts/compression-codec-tradeoff — tip 7's canonical concept.
- concepts/compression-compaction-cpu-cost — tip 8's canonical concept.
- concepts/tiered-storage-fast-decommission — tip 9's canonical concept.
- concepts/keyed-partitioner — partition-strategy concept.
- patterns/high-cardinality-partition-key — the operational pattern for tip 3.
- patterns/client-side-compression-over-broker-compression — tip 7's canonical pattern.
- concepts/broker-write-caching — tip 2's concept.
- patterns/broker-write-caching-as-client-tuning-substitute — already-canonicalised pattern; reused here.
- concepts/batching-latency-tradeoff — tip 4's concept.
- concepts/sticky-partitioner — tip 3's mitigation.
- concepts/acks-producer-durability — tip 2's
acks=allcompanion. - concepts/rpo-rto — tip 6's commit-frequency-as-RPO link.
- patterns/tiered-storage-to-object-store — tip 9's broader pattern.
- concepts/noisy-neighbor — tip 1's deployment-isolation rule.