Skip to content

CONCEPT Cited by 1 source

Hot-cold tier compression codec split

Definition

A storage-tiering discipline where different compression codecs are applied per tier according to the query-access profile: recent/hot data uses a lightweight codec (decode-fast, modest-ratio) on fast local storage; aged/cold data uses an aggressive codec (decode-slow, high-ratio) on slow bulk storage (e.g. object storage).

The canonical pairing (from the 2025-12-09 Redpanda IoT pipeline post):

  • Hot — local SSD, LZ4 codec → fast decode, moderate compression ratio, OK for the continuous-access query load.
  • Cold — S3-backed storage, ZSTD codec → small footprint, slower decode, OK because the data is queried infrequently.

Verbatim framing: "Use ClickHouse's S3-based hybrid storage to optimize performance while keeping costs low. Your most recent data can be compressed using a lightweight codec, such as LZ4, and stored on local SSDs, while older data can be aggressively compressed and offloaded to S3-backed storage. Keep in mind that higher compression levels reduce storage footprint but increase data access latency. For this reason, they're best reserved for long-term archival tiers where data is infrequently queried." (Source: sources/2025-12-09-redpanda-streaming-iot-and-event-data-into-snowflake-and-clickhouse)

Why tier by codec, not just by medium

Standard media-based tiering (SSD vs HDD vs S3) saves $ by moving aged data off fast storage. Codec-based tiering stacks a second optimization on top:

  1. Storage size is the primary cost driver on cold tiers (S3 charges per-GB-month). Aggressive compression directly reduces the bill.
  2. CPU cost of decompression is only paid on query. Cold data is queried rarely — you're paying the decode tax only when you actually need the data.
  3. Hot-data query latency is the primary perf driver. Hot data is hit constantly — you want fastest possible decode, not smallest possible bytes.

Canonical codec choices

Codec Typical use Ratio Decode cost
NONE Very fast hot data 1.0× near-zero
LZ4 Default hot codec 2–3× very low
LZ4HC Moderate hot 3–4× low
ZSTD(1) Balanced hot 3–5× low-moderate
ZSTD(22) Aggressive cold 5–8× high
Delta + ZSTD Time-series cold 8–15× on monotonic columns moderate
DoubleDelta + ZSTD Monotonic time cols 15–25× moderate
Gorilla Float time-series 2–6× low

ClickHouse supports these via column-level CODEC(...) clauses, set at table creation or altered:

ALTER TABLE telemetry_events
    MODIFY COLUMN value Float64 CODEC(ZSTD);

Composes with

Caveats

  • Re-compressing existing data is not free. Changing a codec on an existing column requires a mutation (re-read, re-encode, re-write) which is expensive on large tables. Easier to get the hot-cold split right at the tier-move step (write-with-new-codec).
  • Decompression-on-query overhead on cold tiers can dominate small queries. A point-lookup on a ZSTD(22) column may pay more in decode than in storage lookup.
  • Compression-ratio depends heavily on data shape. Timestamp/monotonic columns compress 10–25×; random-UUID columns barely 1.2×. Benchmarking on real data is required.
  • Codec choice also affects CPU cost at insert time. ZSTD(22) on hot ingest path kills write throughput; this pattern specifically avoids that by applying aggressive codecs only after data ages out.

Seen in

Last updated · 470 distilled / 1,213 read