CONCEPT Cited by 1 source

Tokenize transform¶

Definition¶

The tokenize transform replaces a stream of repeated values with a pair:

A dictionary — the set of distinct values, stored once.
An index list — one integer per position, pointing into the dictionary.

It's reversible (the decoder looks up each index in the dictionary) and lossless. In the format-aware compression setting, it's a pre-entropy-coding transform applied to streams with cardinality much smaller than their length.

When it pays¶

From the OpenZL post's sao example (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

"The other fields (IS, MAG, XRPM, XDPM) share a common property: their cardinality is much lower than their quantities, and there is no relation between 2 consecutive values. This makes them a good target for tokenize, which will convert the stream into a dictionary and an index list."

The trigger signal is low cardinality + no inter-value relation:

Low cardinality → the dictionary is small.
No inter-value relation → delta won't help; tokenize is the right choice.

The downstream split¶

A key architectural consequence: the dictionary and the index list have very different shapes, so they benefit from different downstream compression strategies:

"The resulting dictionaries and index lists are very different. They benefit from completely different compression strategies. So they are sent to dedicated processing graphs." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)

This is structure-creating — the tokenize transform produces two streams where there was one, and each gets its own sub-plan in the compression Plan. The result is recursive: OpenZL compresses the transform outputs, which may themselves get transformed.

In the trained Plan¶

The OpenZL trainer picks tokenize when its search identifies a stream with the right cardinality properties. The choice is per-field, not global — the same file can have some fields tokenized, some delta- encoded, some transposed. That heterogeneity is exactly what the SoA decomposition makes possible.

Zstandard dictionary compression (the 2018 precursor to OpenZL's Managed Compression) trains a single shared dictionary across many small messages to amortize the dictionary cost. That's a between-message dictionary.
Tokenize in OpenZL builds a per-stream dictionary, inline, for one field of one frame. That's a within-stream dictionary.

Both are dictionary-based compression techniques; they operate at different scales.

Seen in¶

sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework — canonical wiki source for tokenize as one of OpenZL's trained transform choices.

systems/openzl — where tokenize is one of the transforms the trainer can pick.
concepts/reversible-transform-sequence — the pattern tokenize fits into.
concepts/format-aware-compression — the parent category.
concepts/structure-of-arrays-decomposition — typically precedes tokenize so that a field's cardinality is visible as a single stream.
concepts/delta-encoding — sibling pre-entropy transform chosen for different stream shapes (sorted numeric rather than low-cardinality).