CONCEPT Cited by 1 source

Hot-cold tier compression codec split¶

Definition¶

A storage-tiering discipline where different compression codecs are applied per tier according to the query-access profile: recent/hot data uses a lightweight codec (decode-fast, modest-ratio) on fast local storage; aged/cold data uses an aggressive codec (decode-slow, high-ratio) on slow bulk storage (e.g. object storage).

The canonical pairing (from the 2025-12-09 Redpanda IoT pipeline post):

Hot — local SSD, LZ4 codec → fast decode, moderate compression ratio, OK for the continuous-access query load.
Cold — S3-backed storage, ZSTD codec → small footprint, slower decode, OK because the data is queried infrequently.

Verbatim framing: "Use ClickHouse's S3-based hybrid storage to optimize performance while keeping costs low. Your most recent data can be compressed using a lightweight codec, such as LZ4, and stored on local SSDs, while older data can be aggressively compressed and offloaded to S3-backed storage. Keep in mind that higher compression levels reduce storage footprint but increase data access latency. For this reason, they're best reserved for long-term archival tiers where data is infrequently queried." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Why tier by codec, not just by medium¶

Standard media-based tiering (SSD vs HDD vs S3) saves $ by moving aged data off fast storage. Codec-based tiering stacks a second optimization on top:

Storage size is the primary cost driver on cold tiers (S3 charges per-GB-month). Aggressive compression directly reduces the bill.
CPU cost of decompression is only paid on query. Cold data is queried rarely — you're paying the decode tax only when you actually need the data.
Hot-data query latency is the primary perf driver. Hot data is hit constantly — you want fastest possible decode, not smallest possible bytes.

Canonical codec choices¶

Codec	Typical use	Ratio	Decode cost
`NONE`	Very fast hot data	1.0×	near-zero
`LZ4`	Default hot codec	2–3×	very low
`LZ4HC`	Moderate hot	3–4×	low
`ZSTD(1)`	Balanced hot	3–5×	low-moderate
`ZSTD(22)`	Aggressive cold	5–8×	high
`Delta + ZSTD`	Time-series cold	8–15× on monotonic columns	moderate
`DoubleDelta + ZSTD`	Monotonic time cols	15–25×	moderate
`Gorilla`	Float time-series	2–6×	low

ClickHouse supports these via column-level CODEC(...) clauses, set at table creation or altered:

ALTER TABLE telemetry_events
    MODIFY COLUMN value Float64 CODEC(ZSTD);

Composes with¶

concepts/compression-codec-tradeoff — the general ratio-vs-CPU trade-off; this pattern applies it tier-by-tier.
concepts/clickhouse-ttl-policy — the mechanism that moves data between tiers at age boundaries (TO VOLUME 'cold').
concepts/clickhouse-detached-partition-archival — the manual equivalent; detached partitions can be re- compressed with a heavier codec before moving off-host.
concepts/storage-media-tiering — the outer pattern (SSD/HDD/S3); codec-split is the inner refinement.

Caveats¶

Re-compressing existing data is not free. Changing a codec on an existing column requires a mutation (re-read, re-encode, re-write) which is expensive on large tables. Easier to get the hot-cold split right at the tier-move step (write-with-new-codec).
Decompression-on-query overhead on cold tiers can dominate small queries. A point-lookup on a ZSTD(22) column may pay more in decode than in storage lookup.
Compression-ratio depends heavily on data shape. Timestamp/monotonic columns compress 10–25×; random-UUID columns barely 1.2×. Benchmarking on real data is required.
Codec choice also affects CPU cost at insert time. ZSTD(22) on hot ingest path kills write throughput; this pattern specifically avoids that by applying aggressive codecs only after data ages out.

Seen in¶

sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse — canonical wiki introduction. Redpanda IoT-pipeline tutorial post frames the LZ4-on-SSD + ZSTD-on-S3 pairing as the canonical hot-cold split for ClickHouse hybrid storage, explicitly trading compression ratio against access latency.