Skip to content

CONCEPT Cited by 1 source

Reversible transform sequence

Definition

A reversible transform sequence is the pipeline of operations a format-aware compressor applies to input data before entropy coding. Each transform is lossless and invertible — the decoder runs the inverse sequence — and each transform is chosen to surface compression-friendly patterns that generic byte-level compression can't see.

"Compression can then focus on a sequence of reversible steps that surface patterns before coding." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)

The architectural role

Generic compressors (zstd, gzip, xz) operate directly on byte streams — their entropy coder + matching engine sees whatever the input happens to look like. Format-aware compressors split the job:

raw bytes → [transform 1] → [transform 2] → ... → [transform N]
                                              entropy coding

The transforms surface patterns (sorted → delta-encodable, low-cardinality → tokenizable, multi-byte-numeric → transposable). The entropy coder then compresses the transformed stream, which is much more predictable than the raw input.

The decoder does the inverse:

compressed bytes → entropy decode → [inverse N] → ... → [inverse 1] → raw bytes

Canonical transforms in OpenZL

From the Silesia sao worked example (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

  • Split header from body — header has different entropy characteristics than record-table body; separating them lets each get its own downstream strategy.
  • AoS → SoA — array-of-records transposed into one stream per field; each stream is homogeneous.
  • Delta — for mostly-sorted numeric streams; encodes x[i] - x[i-1] instead of x[i]. Reduces range, narrows entropy.
  • Transpose — for bounded-range multi-byte numbers; groups higher bytes (often predictable) together for the entropy coder.
  • Tokenize — for low-cardinality streams; emits dictionary + index list.
  • Recursive compression of transform outputs — the dictionary from a tokenize step is itself data, and gets its own sub-Plan.

Trained, not hand-coded

The compression Plan specifies which transforms, in what order, with what parameters. OpenZL's trainer runs a budgeted search over transform choices + parameters against sample data; the output Plan is reproducible as a Resolved Graph embedded in every frame, which the universal decoder executes in reverse.

Key property: each transform is reversible

This is what distinguishes the pattern from lossy or ML-based approaches:

  • Every transform is a pure function with a well-defined inverse.
  • The decoder does not need to know the reason a transform was chosen — it just runs the recorded inverse.
  • The sequence is serialized (as the Resolved Graph) and shipped in the frame.

Loss of reversibility on any single transform breaks lossless guarantees for the whole sequence. Meta's post is explicit that OpenZL is lossless; the entire transform library is chosen on that basis.

Seen in

Last updated · 319 distilled / 1,201 read