Meta — Introducing OpenZL: An Open Source Format-Aware Compression Framework¶

Summary¶

Meta announces OpenZL, a new open-source lossless compression framework that targets structured data (tabular, columnar, numeric arrays, timeseries, ML tensors, database pages) and claims performance "comparable to specialized compressors" while keeping a single universal decoder binary for every file it produces. The post is Meta's architectural answer to the tension they hit after a decade with Zstandard (announced 2016): generic compressors leave ratio on the table for structured data, but hand-rolled per-format compressors mean another binary to ship, audit, patch, and trust for every format. OpenZL resolves this by pushing format-awareness into an input parameter + learned compression Plan that resolves to a decode recipe embedded in the frame, which the universal decoder executes.

Key takeaways¶

Structure-as-explicit-input is the core architectural bet. General compressors either try one-size-fits-all or "spend a lot of their cycles guessing which techniques to use." OpenZL makes the user declare the shape (via a preset, SDDL description, or a registered parser function); the compressor then spends its cycles on a sequence of reversible transforms that surface patterns before entropy coding rather than on guessing.
Headline numbers on Silesia's sao file (M1 CPU, clang-17): zstd -3: 5,531,935 B / x1.31 / 220 MB/s comp / 850 MB/s decomp. xz -9: 4,414,351 B / x1.64 / 3.5 MB/s comp / 45 MB/s decomp. OpenZL: 3,516,649 B / x2.06 / 340 MB/s comp / 1200 MB/s decomp. OpenZL is both higher ratio and faster than xz on this format; higher ratio "while preserving or even improving speed" is the claim Meta foregrounds as critical for data-center pipelines.
Universal decoder is the deployment-critical property, not a nicety. "Even when the compression configuration changes, the decoder does not." Consequences Meta enumerates: one audited surface for security
fuzzing; fleet-wide improvements (SIMD, bounds, scheduling) benefit every compressed file including older frames; same binary + CLI + metrics + dashboards across datasets; continuous training — train a plan offline, try on a slice, roll out like any config change, old frames keep decoding.
The decomposition moves structure forward via reversible transforms (Source: this post). The sao worked example: (a) separate header from record-table; (b) split the array-of-structs into its per-field columns ( AoS → SoA); (c) pick a transform per column based on its shape — delta for mostly-sorted X-axis positions (SRA0), transpose for bounded-range Y-axis (SDEC0) where higher bytes are predictable, tokenize for low-cardinality fields (IS/MAG/XRPM/XDPM) producing a dictionary + index list, each routed to its own subgraph. "The main work is to group data into homogeneous streams. After that, one can count on openzl to take care of the rest."
The trainer is the offline optimization component that produces a Plan from a budgeted search over transform choices and parameters. Internally uses a cluster finder (groups fields that behave alike) + a graph explorer (tries candidate subgraphs, keeps score). It can emit a full speed/ratio Pareto set or directly target the best config under a speed constraint. Output = Plan; encoder resolves Plan into a concrete Resolved Graph at encode time, picks branches at control points if any, records the choice into the frame; the single decoder reads the Resolved Graph from the frame and executes.
Runtime control points + in-flight adaptation. A Plan may include control points that read lightweight statistics at compression time (string-repetition stats, run-length, histogram skew, delta variance) and pick a branch. Meta names "textbook classifiers" as sufficient; exploration is bounded to preserve speed targets. The taken branch is recorded; the decoder executes the recorded path without re-running classification. "Best of both worlds: dynamic behavior at compression time… with zero complexity added to the decoder."
Managed Compression is the operational runtime OpenZL plugs into. Originally built to automate dictionary compression with Zstandard (2018); OpenZL brings the same loop — registered use cases monitored, sampled, periodically re-trained, new configs shipped when beneficial — to format-aware compression. "The decompression side continues to decode both old and new data without any change."
Pareto-frontier benchmarking across four dataset shapes (Figures 1-4): (a) Silesia sao (astronomical catalog, array-of-structs); (b) ERA5 Flux (single 64-bit numeric array — columnar numeric); (c) Binance + NYC TLC Green Trip (uncompressed Parquet — OpenZL parses the Parquet format + learns schema to tune per-file); (d) PPMF Unit (CSV — OpenZL is capped "at about 64 MB/s" by CSV parsing cost; "an improved parser will speed that up, however this strategy will likely never approach Zstd's speeds of 1 GB/s").
Honest about failure mode: "When there is no structure, there is no advantage. This is typically the case in pure text documents, such as enwik or dickens." In that regime OpenZL falls back to zstd and delivers essentially zstd-equivalent performance. This is the fallback-to-zstd safety net — OpenZL is never worse than zstd because it can always choose the "just run zstd" plan.
Two iteration paths for evolving data: re-training, where Managed Compression samples fresh data and generates an updated Plan rolled out like a config change; and in-flight control points, where the Plan already encodes branching on runtime statistics so it adapts per-frame without regeneration. The two mechanisms operate at different timescales — Plan-regeneration handles schema drift or seasonal shifts that out-grow the existing Plan's branches; control points handle bursts + outliers + within-Plan variation.

Architectural primitives¶

Core primitives¶

concepts/format-aware-compression — the category this framework defines. Generic compressors work byte-by-byte; OpenZL works over user-declared structure (rows / columns / enums / ranges / nested records). Canonical wiki instance.
concepts/universal-decoder — one binary decodes everything OpenZL has ever produced regardless of which plan produced it, because the resolved decode recipe is embedded in the frame. Canonical wiki instance.
concepts/compression-plan — the learned output of the trainer. A plan contains transform choices, parameters, and (optionally) control points; encoders resolve it into a Resolved Graph per frame.
concepts/reversible-transform-sequence — the mechanism by which format-aware compression works. Apply a sequence of reversible operations (split header, AoS→SoA, delta, transpose, tokenize) to surface patterns, then entropy-code. The decoder runs the inverse sequence.
concepts/structure-of-arrays-decomposition — array-of-struct records transformed into struct-of-arrays (one stream per field) so that each field stream is "homogeneous data of the same type and semantic meaning" and can be compressed with a field-specific strategy.
concepts/runtime-control-point-compression — per-frame branch points that read lightweight statistics at encode time and pick a subgraph, recording the choice into the frame. Adapts to per-frame variation without re-training the Plan.
concepts/delta-encoding — the transform OpenZL picks for mostly-sorted numeric streams (SRA0 X-axis case).
concepts/tokenize-transform — the transform OpenZL picks for low-cardinality streams; produces dictionary + index list, each routed to a dedicated subgraph.

Systems¶

systems/openzl — the framework itself (open-sourced 2025-10-06, openzl.org, github.com/facebook/openzl).
systems/zstandard-zstd — Meta's 2016 general-purpose compressor. OpenZL's fallback when no structure is known, and the baseline against which OpenZL's ratio gains + speed preservation are measured. Also the original integration of Managed Compression.
systems/openzl-sddl — Simple Data Description Language, the declarative way to describe byte layouts (rows / columns / enums / nested records) without writing a parser. Parser-equivalent alternative is a registered parser function.
systems/managed-compression-meta — Meta's runtime for monitoring registered use cases, sampling data, periodically re-training, and rolling out new plans. Originally built for zstd dictionaries (2018); extended to OpenZL plans.

Patterns¶

patterns/offline-train-online-resolve-compression — canonical wiki pattern: offline trainer learns a Plan against sample data → encoder resolves Plan to concrete Resolved Graph per frame → universal decoder reads Resolved Graph from frame and executes. Separates learning from hot path.
patterns/embedded-decode-recipe-in-frame — the frame itself carries enough to decode; no out-of-band config, no "which decoder version?" matrix. Enables universal-decoder + graceful upgrades.
patterns/fallback-to-general-purpose-compressor — safety net when structure-awareness doesn't pay (pure text, unknown format): trainer selects a Plan that reduces to "just run zstd", so OpenZL's worst case is zstd-equivalent, not worse.
patterns/graceful-upgrade-via-monoversion-decoder — ship a new Plan, ingest new frames immediately; old frames continue to decode unchanged with the same binary. A decoder-binary update improves every frame (old + new). Eliminates format-version coordination between producers and consumers.

Operational numbers disclosed¶

Metric	Value
`sao` uncompressed size (implicit baseline)	larger than all three compressed outputs
zstd -3 on `sao`	5,531,935 B · x1.31 · 220 MB/s comp · 850 MB/s decomp
xz -9 on `sao`	4,414,351 B · x1.64 · 3.5 MB/s comp · 45 MB/s decomp
OpenZL on `sao`	3,516,649 B · x2.06 · 340 MB/s comp · 1200 MB/s decomp
CSV parse-bound compression speed cap	"about 64 MB/s" (PPMF Unit)
Zstd speed reference point for parsed CSV comparison	"~1 GB/s"
CPU / compiler used for Silesia `sao` measurements	M1 CPU, clang-17

Caveats¶

Single disclosed file for the compressor comparison. The M1 / sao numbers are for one file of the Silesia corpus, chosen because it has a "well-defined format featuring an array of records." The broader Pareto-curve plots (ERA5 / Parquet / CSV) are referenced to Meta's whitepaper (arXiv:2510.03203) and the OpenZL reproducibility scripts; individual datapoint numbers aren't transcribed in the post body.
No fleet-scale deployment numbers. The post is a framework + vision piece, not a production retrospective. Meta states OpenZL plugs into Managed Compression alongside its existing zstd-dictionary deployments but gives no fleet size, bytes-compressed/day, or latency figures for internal use.
No Plan-size or frame-overhead numbers. The Resolved Graph embedded in each frame has cost — not disclosed. Implied to be small relative to the compressed payload, because the CSV parse-bound cap is dominated by parsing, not frame overhead.
Trainer cost not characterized. "Budgeted search" is mentioned but no wall-clock, search-budget, or sample-size guidance. Managed Compression's periodic re-training cadence is not disclosed.
SDDL's expressiveness ceiling is acknowledged. "We also are actively working to extend SDDL to describe nested data formats more flexibly." Complex nested formats today require a registered parser function rather than SDDL.
CSV is a structural ceiling, not just a current one. Meta is explicit: "this strategy will likely never approach Zstd's speeds of 1 GB/s" on CSV — the parse cost is inherent to format-awareness on a text-delimited format. For those cases, the fallback-to-zstd plan is always available.
No disclosed benchmarks vs. other format-aware compressors (e.g., Parquet-native compression, specialized timeseries compressors). The post compares against zstd / xz / gzip — all general-purpose compressors — not against specialized format-aware alternatives.

Source¶

Original: https://engineering.fb.com/2025/10/06/openzl-open-source-format-aware-compression-framework/
Raw markdown: raw/meta/2025-10-06-openzl-an-open-source-format-aware-compression-framework-2fe37600.md
Companion whitepaper: arXiv:2510.03203
Project site: openzl.org
GitHub: facebook/openzl
Sample artifacts + reproducibility: openzl-sample-artifacts release

companies/meta — Meta Engineering; this is the 14th Meta post on the wiki and the first compression-framework post on the wiki broadly (prior compression content was audio codecs and MongoDB WiredTiger page compression).
systems/zstandard-zstd — the 2016 predecessor whose architectural limits motivated OpenZL.
concepts/format-aware-compression · concepts/universal-decoder · concepts/compression-plan · concepts/reversible-transform-sequence
patterns/offline-train-online-resolve-compression · patterns/embedded-decode-recipe-in-frame · patterns/fallback-to-general-purpose-compressor · patterns/graceful-upgrade-via-monoversion-decoder