CONCEPT Cited by 1 source
Tokenize transform¶
Definition¶
The tokenize transform replaces a stream of repeated values with a pair:
- A dictionary — the set of distinct values, stored once.
- An index list — one integer per position, pointing into the dictionary.
It's reversible (the decoder looks up each index in the dictionary) and lossless. In the format-aware compression setting, it's a pre-entropy-coding transform applied to streams with cardinality much smaller than their length.
When it pays¶
From the OpenZL post's sao example (Source:
sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):
"The other fields (IS, MAG, XRPM, XDPM) share a common property: their cardinality is much lower than their quantities, and there is no relation between 2 consecutive values. This makes them a good target for tokenize, which will convert the stream into a dictionary and an index list."
The trigger signal is low cardinality + no inter-value relation:
- Low cardinality → the dictionary is small.
- No inter-value relation → delta won't help; tokenize is the right choice.
The downstream split¶
A key architectural consequence: the dictionary and the index list have very different shapes, so they benefit from different downstream compression strategies:
"The resulting dictionaries and index lists are very different. They benefit from completely different compression strategies. So they are sent to dedicated processing graphs." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)
This is structure-creating — the tokenize transform produces two streams where there was one, and each gets its own sub-plan in the compression Plan. The result is recursive: OpenZL compresses the transform outputs, which may themselves get transformed.
In the trained Plan¶
The OpenZL trainer picks tokenize when its search identifies a stream with the right cardinality properties. The choice is per-field, not global — the same file can have some fields tokenized, some delta- encoded, some transposed. That heterogeneity is exactly what the SoA decomposition makes possible.
Related to dictionary-based compression¶
- Zstandard dictionary compression (the 2018 precursor to OpenZL's Managed Compression) trains a single shared dictionary across many small messages to amortize the dictionary cost. That's a between-message dictionary.
- Tokenize in OpenZL builds a per-stream dictionary, inline, for one field of one frame. That's a within-stream dictionary.
Both are dictionary-based compression techniques; they operate at different scales.
Seen in¶
- sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework — canonical wiki source for tokenize as one of OpenZL's trained transform choices.
Related¶
- systems/openzl — where tokenize is one of the transforms the trainer can pick.
- concepts/reversible-transform-sequence — the pattern tokenize fits into.
- concepts/format-aware-compression — the parent category.
- concepts/structure-of-arrays-decomposition — typically precedes tokenize so that a field's cardinality is visible as a single stream.
- concepts/delta-encoding — sibling pre-entropy transform chosen for different stream shapes (sorted numeric rather than low-cardinality).