CONCEPT Cited by 1 source
Hot-cold tier compression codec split¶
Definition¶
A storage-tiering discipline where different compression codecs are applied per tier according to the query-access profile: recent/hot data uses a lightweight codec (decode-fast, modest-ratio) on fast local storage; aged/cold data uses an aggressive codec (decode-slow, high-ratio) on slow bulk storage (e.g. object storage).
The canonical pairing (from the 2025-12-09 Redpanda IoT pipeline post):
- Hot — local SSD,
LZ4codec → fast decode, moderate compression ratio, OK for the continuous-access query load. - Cold — S3-backed storage,
ZSTDcodec → small footprint, slower decode, OK because the data is queried infrequently.
Verbatim framing: "Use ClickHouse's S3-based hybrid storage
to optimize performance while keeping costs low. Your most
recent data can be compressed using a lightweight codec, such
as LZ4, and stored on local SSDs, while older data can be
aggressively compressed and offloaded to S3-backed storage.
Keep in mind that higher compression levels reduce storage
footprint but increase data access latency. For this reason,
they're best reserved for long-term archival tiers where
data is infrequently queried."
(Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)
Why tier by codec, not just by medium¶
Standard media-based tiering (SSD vs HDD vs S3) saves $ by moving aged data off fast storage. Codec-based tiering stacks a second optimization on top:
- Storage size is the primary cost driver on cold tiers (S3 charges per-GB-month). Aggressive compression directly reduces the bill.
- CPU cost of decompression is only paid on query. Cold data is queried rarely — you're paying the decode tax only when you actually need the data.
- Hot-data query latency is the primary perf driver. Hot data is hit constantly — you want fastest possible decode, not smallest possible bytes.
Canonical codec choices¶
| Codec | Typical use | Ratio | Decode cost |
|---|---|---|---|
NONE |
Very fast hot data | 1.0× | near-zero |
LZ4 |
Default hot codec | 2–3× | very low |
LZ4HC |
Moderate hot | 3–4× | low |
ZSTD(1) |
Balanced hot | 3–5× | low-moderate |
ZSTD(22) |
Aggressive cold | 5–8× | high |
Delta + ZSTD |
Time-series cold | 8–15× on monotonic columns | moderate |
DoubleDelta + ZSTD |
Monotonic time cols | 15–25× | moderate |
Gorilla |
Float time-series | 2–6× | low |
ClickHouse supports these via column-level CODEC(...)
clauses, set at table creation or altered:
Composes with¶
- concepts/compression-codec-tradeoff — the general ratio-vs-CPU trade-off; this pattern applies it tier-by-tier.
- concepts/clickhouse-ttl-policy — the mechanism that
moves data between tiers at age boundaries (
TO VOLUME 'cold'). - concepts/clickhouse-detached-partition-archival — the manual equivalent; detached partitions can be re- compressed with a heavier codec before moving off-host.
- concepts/storage-media-tiering — the outer pattern (SSD/HDD/S3); codec-split is the inner refinement.
Caveats¶
- Re-compressing existing data is not free. Changing a codec on an existing column requires a mutation (re-read, re-encode, re-write) which is expensive on large tables. Easier to get the hot-cold split right at the tier-move step (write-with-new-codec).
- Decompression-on-query overhead on cold tiers can
dominate small queries. A point-lookup on a
ZSTD(22)column may pay more in decode than in storage lookup. - Compression-ratio depends heavily on data shape. Timestamp/monotonic columns compress 10–25×; random-UUID columns barely 1.2×. Benchmarking on real data is required.
- Codec choice also affects CPU cost at insert time.
ZSTD(22)on hot ingest path kills write throughput; this pattern specifically avoids that by applying aggressive codecs only after data ages out.
Seen in¶
- sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse
— canonical wiki introduction. Redpanda IoT-pipeline
tutorial post frames the
LZ4-on-SSD +ZSTD-on-S3 pairing as the canonical hot-cold split for ClickHouse hybrid storage, explicitly trading compression ratio against access latency.