Skip to content

CONCEPT Cited by 1 source

Format-aware compression

Definition

Format-aware compression is lossless compression where the compressor is given — as an explicit input — the shape of the data it's compressing (rows, columns, enums, numeric ranges, nested records, timeseries shape, tensor shape). The compressor uses that shape to apply a sequence of reversible transforms that surface patterns before entropy coding, rather than treating input as an undifferentiated byte stream.

Contrast with generic compression (zstd, xz, gzip): the compressor sees only bytes, so it either applies one-size-fits-all techniques or "spend[s] a lot of their cycles guessing which techniques to use." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)

Why the category exists

Meta's framing in the OpenZL launch post: using generic methods on structured data leaves compression gains on the table. "Data isn't just byte soup. It can be columnar, encode enums, be restricted to specific ranges, or carry highly repetitive fields. More importantly, it has predictable shapes. A bespoke compressor that leans into that structure can beat general-purpose tools on both ratio and speed."

Format-awareness is the middle ground between:

  • Generic compression: one binary, universal, lower ratio on structured data.
  • Bespoke per-format compressors: high ratio, but "every bespoke scheme means another compressor and decompressor to create, ship, audit, patch, and trust."

The three design levers

Format-aware compressors vary on three axes (implicit in the OpenZL post, explicit in the compression Plan architecture):

  1. How the shape is declared. OpenZL takes either an SDDL declaration or a registered parser function; specialized compressors typically hard-code the format.
  2. How the transform sequence is chosen. OpenZL's trainer runs a budgeted search offline and produces a Plan; hard-coded compressors bake the sequence into the binary.
  3. How the decoder knows which transforms to invert. OpenZL embeds the resolved graph in the frame so one universal decoder handles everything. Hard-coded format-specific compressors need a matching format-specific decoder.

Typical format-aware transforms

Named in the OpenZL post's worked example on the Silesia sao file:

  • Split header from body — different entropy characteristics, so compress separately.
  • Array-of-struct → Struct-of-arrays — one stream per field; each is homogeneous in type + semantics and can be compressed with a field-specific strategy.
  • Delta encoding for mostly-sorted numeric streams (reduces value range, makes downstream entropy coding more effective).
  • Transpose for bounded-range multi-byte numbers (higher bytes group together, become more predictable).
  • Tokenize for low-cardinality streams (dictionary + index list, each routed to its own subgraph).
  • Recursive compression of transform outputs — the dictionary from a tokenize step is itself data with a shape; compress it with its own sub-Plan.

When format-awareness doesn't pay

"When there is no structure, there is no advantage. This is typically the case in pure text documents, such as enwik or dickens." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)

In that regime OpenZL falls back to zstd (fallback-to-zstd). Format-awareness also has a parse-cost ceiling: CSV compression in OpenZL is capped at ~64 MB/s because "this strategy will likely never approach Zstd's speeds of 1 GB/s" on text-delimited formats.

Canonical wiki instance

  • systems/openzl (Meta, 2025) — the 2025 canonical instance. Announced 2025-10-06.

Prior-art compressors (Parquet-native compression, specialized timeseries compressors, column-store-internal compression) are format-aware in the sense defined here, but the OpenZL post is the wiki's first primary-source treatment of the category as an architectural category distinct from generic compression.

Seen in

Last updated · 319 distilled / 1,201 read