SYSTEM Cited by 1 source

OpenZL¶

OpenZL is Meta's open-source format-aware lossless compression framework, publicly released 2025-10-06. It targets structured data — tabular, columnar, numeric arrays, timeseries, ML tensors, database pages — and claims ratios "comparable to specialized compressors" while keeping a single universal decoder for every file ever produced, regardless of which compression configuration generated it.

Project¶

Site: openzl.org
GitHub: facebook/openzl
Whitepaper: arXiv:2510.03203
Reproducibility artifacts: openzl-sample-artifacts release
Quick Start: openzl getting-started guide
Predecessor lineage: Zstandard (2016) → OpenZL (2025). Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.

The architectural pitch¶

Meta's own framing (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

"OpenZL is our answer to the tension between the performance of format-specific compressors and the maintenance simplicity of a single executable binary."

Generic compressors leave ratio on the table for structured data. Bespoke per-format compressors win on ratio but multiply the compressor-and-decompressor binaries you have to create, ship, audit, patch, and trust. OpenZL's resolution: push format-awareness into an input parameter plus a learned Plan that resolves to a decode recipe embedded in the frame. The decoder reads the recipe from the frame and executes. One decoder binary, arbitrarily many formats.

Headline numbers¶

Silesia corpus sao file (astronomical catalog, array-of-records), M1 CPU, clang-17 (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

Compressor	Size (B)	Ratio	Comp speed	Decomp speed
zstd -3	5,531,935	x1.31	220 MB/s	850 MB/s
xz -9	4,414,351	x1.64	3.5 MB/s	45 MB/s
OpenZL	3,516,649	x2.06	340 MB/s	1,200 MB/s

OpenZL is both higher-ratio and faster (in both directions) than xz, and higher-ratio than zstd at comparable-or-better speed. The "higher ratio while preserving or even improving speed" property is what Meta foregrounds as critical for data-center processing pipelines.

Architecture¶

Components¶

SDDL — Simple Data Description Language. Declarative byte-layout description (rows / columns / enums / nested records). Parser-only: describes shape, does not carry logic. Meta is actively extending SDDL for more-flexible nested-format description.
Parser function (alternative to SDDL) — embedder-authored logic in a supported language, registered with OpenZL, used when SDDL isn't expressive enough for the format.
Trainer — offline optimization component. Consumes sample data
preset / parser-function / SDDL description. Runs a budgeted search over transform choices and parameters. Internal subsystems:
Cluster finder — groups fields that behave alike so one sub-plan serves multiple columns.
Graph explorer — tries candidate subgraphs, scores them.
Output: a Plan, optionally with control points. Can emit a full speed/ratio Pareto set or directly target the best config under a speed constraint.
Encoder — consumes the Plan and the current data frame; resolves the Plan into a concrete Resolved Graph; if the Plan has control points, reads lightweight statistics from the frame and picks a branch; records the resolved graph into the frame.
Universal decoder — reads the Resolved Graph from the frame, enforces limits, executes the transform sequence in reverse order. Same binary for every Plan and every historic frame.

The "structure-of-arrays" example¶

From the post's worked example on sao:

Split header from the record table.
Array-of-struct → struct-of-array. Each field (SRA0, SDEC0, IS, MAG, XRPM, XDPM) becomes its own stream. (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework). This is the core concepts/structure-of-arrays-decomposition step — after it, each stream is "homogeneous data of the same type and semantic meaning" and can be optimized independently.
Per-field transforms chosen by the trainer:
- SRA0 (X-axis position; mostly sorted) → delta to reduce value range.
- SDEC0 (Y-axis position; bounded range, unsorted) → transpose so higher bytes (more predictable) group together.
- IS / MAG / XRPM / XDPM (low cardinality, no inter-value relation) → tokenize to produce a dictionary + index list, each routed to its own subgraph.
Each subgraph is compressed independently with strategies optimized for its data shape. "The main work is to group data into homogeneous streams. After that, one can count on openzl to take care of the rest."

Runtime adaptation¶

A Plan may contain control points: per-frame branch points that read lightweight statistics (string repetition, run-length, histogram skew, delta variance) and pick a subgraph. Exploration is bounded to preserve speed targets. Chosen branches are recorded in the frame; the decoder just executes the recorded path. "Dynamic behavior at compression time to handle variations and exceptions — without turning compression into an unbounded search problem — and with zero complexity added to the decoder."

Two timescales of adaptation (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

Plan regeneration (via Managed Compression) — Meta periodically re-samples production data, re-runs the trainer, rolls out new Plans like any config change. Handles schema drift + seasonal shifts that outgrow the existing Plan's branches.
Control-point branching (per frame) — the Plan itself encodes branching on runtime statistics so it adapts per-frame without regeneration. Handles bursts + outliers within the Plan's configured space.

Deployment properties (what the universal decoder gives you)¶

Enumerated in the post (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

One audited surface. Security + correctness reviews focus on a single binary with consistent invariants, fuzzing, and hardening. No per-format decompressor to drift.
Fleet-wide improvements benefit every frame. SIMD kernels, memory-bounds fixes, scheduling — update the decoder, every compressed file in history gets faster or safer. This is the monoversion- decoder property.
Operational clarity. Same binary, same CLI, same metrics across datasets; patching + rollout are uneventful by design.
Continuous training. New Plan → roll out → old frames continue to decode unchanged, new frames benefit.

Use cases OpenZL targets¶

Vector, tabular, tree-structured data.
Numeric / string / binary inside structured formats.
Timeseries datasets.
ML tensors.
Database tables.

When OpenZL is not useful¶

"When there is no structure, there is no advantage. This is typically the case in pure text documents, such as enwik or dickens. In these cases, OpenZL falls back to zstd, offering essentially the same level of performance."

This is the fallback-to-zstd safety net: OpenZL's worst case is zstd-equivalent, not worse, because the trainer can always select a Plan that reduces to "just run zstd."

Also: CSV parsing caps compression speed at about 64 MB/s (PPMF Unit dataset). "An improved parser will speed that up, however this strategy will likely never approach Zstd's speeds of 1 GB/s." CSV's text-delimited parse cost is inherent to format-awareness on that format.

Future direction (post-launch roadmap)¶

Extending the transform library for timeseries + grid-shaped data.
Speeding up codec kernels.
Letting the trainer find better Plans faster.
Extending SDDL for more-flexible nested-format description.
Making the automated compressor explorer better at proposing safe, testable changes within a specified budget.

Seen in¶

sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework — the launch post; canonical architectural source for OpenZL on the wiki.

systems/zstandard-zstd — OpenZL's predecessor + fallback + speed baseline.
systems/openzl-sddl — the shape-declaration language.
systems/managed-compression-meta — the runtime that drives OpenZL Plan lifecycle at Meta.
concepts/format-aware-compression · concepts/universal-decoder · concepts/compression-plan · concepts/reversible-transform-sequence · concepts/structure-of-arrays-decomposition · concepts/runtime-control-point-compression
patterns/offline-train-online-resolve-compression · patterns/embedded-decode-recipe-in-frame · patterns/fallback-to-general-purpose-compressor · patterns/graceful-upgrade-via-monoversion-decoder
companies/meta