CONCEPT Cited by 5 sources
Compression codec trade-off¶
Definition¶
Compression codec trade-off is the choice a streaming producer makes between space / bandwidth savings (high compression ratio = fewer bytes over the wire, less disk, cheaper cross-region transfer) and CPU time (compression and decompression are CPU-bound operations). No single codec dominates both axes; the operator picks a point on the curve.
Redpanda's verbatim guidance (Source: sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda):
"There are many choices of compression codecs. Some will compress extremely well, but also require a significant amount of CPU time and memory. Others will compress more moderately, but use far fewer resources. A classic tradeoff."
Why compression matters¶
Kafka's data path involves many byte transfers:
"Producers spend their days sending data, which Redpanda dutifully writes to NVMe devices and sends it over the network to other brokers to do the same. Consumers then send requests for data (via the network), so Redpanda retrieves it (from memory or NVMe) and sends it back over the network. Finally, consumers send in their commits. That's a lot of data transfers."
Each medium (network, disk) has fixed capacity. Compression is the lever: "If you can compress messages at a ratio of 5:1, you can reduce what you would have sent by 80%, which helps every stage of the data lifecycle (ingestion, storage, and retrieval)."
The codecs¶
The four canonical Kafka / Redpanda producer codecs:
| Codec | Compression ratio | CPU cost | Memory cost | Typical use |
|---|---|---|---|---|
| gzip | High | High | High | Legacy; archives where CPU is cheap |
| snappy | Low-medium | Low | Low | Legacy; fast but weak compression |
| LZ4 | Medium | Low | Low | General-purpose default; good balance |
| ZSTD | High | Medium | Medium | Modern default when ratio matters |
The post's bottom-line recommendation:
"Use ZSTD or LZ4 for a good balance between compression ratio and CPU time if compression is essential."
ZSTD (Zstandard, Facebook 2016) is the modern sweet spot — gzip-class ratios at a fraction of the CPU.
LZ4 (2011) is the speed-first choice — lower ratio than ZSTD but orders-of-magnitude lower CPU, particularly on decompression. Preferred for CPU-constrained consumers or compaction-heavy topics (see concepts/compression-compaction-cpu-cost).
gzip and snappy are legacy defaults; no reason to prefer them over ZSTD / LZ4 for new workloads.
Compression composes with batching¶
The ratio is a function of batch size — bigger batches compress better because there are more opportunities for dictionary reuse across records. From Kinley 2024-11-19 part 1:
"The compression ratio improves as you compress more messages at once since it can take advantage of the similarities between messages."
Implications:
- Small batches (< 4 KB) compress poorly — there's no dictionary to reuse.
- Larger batches compress asymptotically better, with diminishing returns past ~100 KB for most payloads.
- Any decision to adopt compression should be co-tuned with effective batch size. A well-batched LZ4 workload outperforms a tiny-batch ZSTD one.
This interacts with Redpanda's sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2|2024-11-26 part 2 production case study: after linger-tuning, the customer's bandwidth dropped from 1.1 GB/sec to 575 MB/sec for the same 1.2 M msg/sec flow — compression was a major contributor alongside Kafka-metadata-overhead reduction. Larger batches → better compression ratios.
Compress on the client, not the broker¶
Two places compression could happen:
- Client-side: producer compresses batches; consumer decompresses. Broker treats batches as opaque bytes.
- Broker-side: client sends uncompressed; broker compresses before writing; decompresses on consumer read.
Redpanda's rule: client-side, always. Verbatim:
"Compress on the client, not the broker (topic configuration for compression should be set to producer)."
Setting compression.type=producer on the topic means the broker
accepts whatever codec the client chose and passes it through
unchanged — no broker CPU spent. Canonicalised as
patterns/client-side-compression-over-broker-compression.
Clients compress batches, not individual messages: "Clients compress batches, not messages, therefore increasing batching will also make compression more effective."
When compression is wrong¶
- Already-compressed payloads — JPEG images, MP4 video, gzipped logs all compress poorly and waste CPU. Double-compression adds overhead without savings.
- Tiny batches — batches below ~1 KB can't accumulate enough repetition for the codec to exploit. Ratio approaches 1.0 (no compression) or even > 1.0 (overhead exceeds savings).
- Compacted topics with ZSTD — every compaction pass must decompress + recompress. See concepts/compression-compaction-cpu-cost. If compaction + compression are both required, prefer LZ4.
- Extreme-latency-sensitive paths — compression adds per-batch CPU; where every µs matters (financial HFT), uncompressed may win.
Seen in¶
- sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse
— extends the trade-off at the per-column-codec axis
for columnar storage engines. ClickHouse's
CODEC(LZ4)/CODEC(ZSTD)column-level clause lets hot columns use the fast codec and cold columns use the aggressive codec within the same table — the broker- stream-level LZ4-vs-ZSTD trade-off becomes a per-storage-tier choice on the analytical warehouse side. Canonicalised as concepts/hot-cold-tier-compression-codec-split: "Your most recent data can be compressed using a lightweight codec, such as LZ4, and stored on local SSDs, while older data can be aggressively compressed and offloaded to S3-backed storage. Keep in mind that higher compression levels reduce storage footprint but increase data access latency." - sources/2025-10-02-redpanda-real-time-analytics-redpanda-snowflake-streaming — extends the trade-off at the encoding-format axis: AVRO delivers ~20% throughput uplift over JSON on 1 KB randomised payloads in the 14.5 GB/s Redpanda → Snowflake benchmark. The randomised content was chosen specifically to neutralise codec-level compression effects and isolate the encoding-format contribution. Composes with codec choice; canonicalised as patterns/binary-format-for-broker-throughput.
- sources/2025-04-23-redpanda-need-for-speed-9-tips-to-supercharge-redpanda — canonical wiki source. ZSTD / LZ4 bottom-line recommendation; client-side-compression rule; 5:1-ratio → 80%-savings arithmetic; batches-compress-better framing.
- sources/2024-11-19-redpanda-batch-tuning-in-redpanda-for-optimized-performance-part-1 — batching-compounds-with-compression framing.
- sources/2024-11-26-redpanda-batch-tuning-in-redpanda-to-optimize-performance-part-2 — production case: 1.1 GB/sec → 575 MB/sec bandwidth drop at identical message rate, attributed jointly to compression + Kafka-metadata overhead.
Related¶
- systems/kafka, systems/redpanda — Kafka-API producer codec selection.
- systems/clickhouse — per-column
CODEC()clause enables hot-cold tiering on columnar storage. - concepts/effective-batch-size — bigger batches compress better.
- concepts/batching-latency-tradeoff — same substrate economics as compression.
- concepts/compression-compaction-cpu-cost — adjacent trade-off where compression and compaction interact.
- concepts/fixed-vs-variable-request-cost — compression reduces the variable cost.
- concepts/hot-cold-tier-compression-codec-split — ClickHouse-specific per-tier per-column codec choice.
- patterns/client-side-compression-over-broker-compression — the operational pattern.