Skip to content

CONCEPT Cited by 1 source

Structure-of-arrays decomposition

Definition

Structure-of-arrays (SoA) decomposition is the reversible transform that turns an array of records (AoS — each record a struct of fields) into a struct of arrays (SoA — one array per field, each array containing that field's value for every record).

In the context of compression, it's a reversible pre-entropy-coding transform that takes heterogeneous records and produces a set of homogeneous per-field streams, so each stream can be compressed with a field-specific strategy.

From the OpenZL post

Meta's worked example on the Silesia sao file (Source: sources/2025-10-06-meta-openzl-an-open-source-format-aware-compression-framework):

"We start by separating the header from the rest, a large table of structures. Then each field gets extracted into its own stream: the array of structures becomes a structure of arrays. After that point, we expect that each stream contains homogeneous data of the same type and semantic meaning. We can now focus on finding an optimal compression strategy for each one."

The point is architectural: AoS → SoA isn't a compression technique by itself — it's the decomposition that makes per-field strategies possible. After it, the compressor picks a different strategy per stream (delta for SRA0, transpose for SDEC0, tokenize for IS/MAG/XRPM/XDPM in the sao example).

Why this is compression-relevant

Byte-level compressors can't easily exploit per-field regularities when fields are interleaved record-by-record:

  • Field N's values are separated by |record-size - sizeof(field-N)| bytes of other data.
  • A sorted or mostly-sorted field (good for delta-encoding) looks random to a byte-level matcher because of the interleaving.
  • A low-cardinality enum field (good for tokenization) has its values scattered through the stream.

SoA decomposition de-interleaves — each field becomes its own contiguous stream, and the per-stream structure becomes visible to downstream transforms + entropy coders.

Database + analytics precedent

The columnar storage format (Parquet, ORC, Arrow, DuckDB, etc.) is the database equivalent — store on disk in SoA rather than AoS so that analytic queries can scan whole columns without touching non-needed fields. OpenZL's on-the-fly AoS → SoA transform brings that same decomposition to general-purpose compression of array-of-records files that aren't already stored columnarly. For inputs that are already columnar (Parquet), OpenZL parses the format and tunes per-column compression directly.

Seen in

Last updated · 319 distilled / 1,201 read