Skip to content

CONCEPT Cited by 1 source

Tokenize transform

Definition

The tokenize transform replaces a stream of repeated values with a pair:

  1. A dictionary — the set of distinct values, stored once.
  2. An index list — one integer per position, pointing into the dictionary.

It's reversible (the decoder looks up each index in the dictionary) and lossless. In the format-aware compression setting, it's a pre-entropy-coding transform applied to streams with cardinality much smaller than their length.

When it pays

From the OpenZL post's sao example (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

"The other fields (IS, MAG, XRPM, XDPM) share a common property: their cardinality is much lower than their quantities, and there is no relation between 2 consecutive values. This makes them a good target for tokenize, which will convert the stream into a dictionary and an index list."

The trigger signal is low cardinality + no inter-value relation:

  • Low cardinality → the dictionary is small.
  • No inter-value relation → delta won't help; tokenize is the right choice.

The downstream split

A key architectural consequence: the dictionary and the index list have very different shapes, so they benefit from different downstream compression strategies:

"The resulting dictionaries and index lists are very different. They benefit from completely different compression strategies. So they are sent to dedicated processing graphs." (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework.)

This is structure-creating — the tokenize transform produces two streams where there was one, and each gets its own sub-plan in the compression Plan. The result is recursive: OpenZL compresses the transform outputs, which may themselves get transformed.

In the trained Plan

The OpenZL trainer picks tokenize when its search identifies a stream with the right cardinality properties. The choice is per-field, not global — the same file can have some fields tokenized, some delta- encoded, some transposed. That heterogeneity is exactly what the SoA decomposition makes possible.

  • Zstandard dictionary compression (the 2018 precursor to OpenZL's Managed Compression) trains a single shared dictionary across many small messages to amortize the dictionary cost. That's a between-message dictionary.
  • Tokenize in OpenZL builds a per-stream dictionary, inline, for one field of one frame. That's a within-stream dictionary.

Both are dictionary-based compression techniques; they operate at different scales.

Seen in

Last updated · 319 distilled / 1,201 read