CONCEPT Cited by 2 sources
Columnar storage format¶
A columnar storage format lays out a tabular dataset on disk by
column rather than by row: all values of column A are
contiguous, then all values of column B, etc. Compared to the
row-oriented layout that OLTP engines use, columnar formats win on
analytical / scan-heavy / compression-sensitive workloads
(concepts/oltp-vs-olap).
Why columnar¶
- Selective I/O. A query reading 3 of 200 columns reads ~3/200 of the file, not the whole thing.
- Per-column compression / encoding. Values in one column are the same type and often low-cardinality; run-length, dictionary, delta, and bitpacking encodings exploit that locality. Each column can pick the encoding that fits.
- Vectorised execution. Scanners process a column's values in tight loops over primitive arrays — SIMD, branch-predictable.
- Column-level metadata for pruning. Min/max, null count, and (as Husky shows) richer pruning structures can be stored per column per row group, enabling the engine to skip row groups or whole files without touching payload data. See concepts/fragment-pruning.
Row groups / pages¶
A columnar file is typically split into row groups (Parquet) / pages. A row group is a horizontal slice containing a bounded number of rows across all columns, stored column-at-a-time within the group. Row groups bound memory for both writers and readers — a reader only needs to materialise one row group per column at a time.
Husky uses the same shape but adapts row-group size to the
heaviest column: if any column's values are large (its message
column for logs allows up to 75 KiB per event), a large fragment
with millions of events would need >1 GiB per column without row
groups, so the compactor sizes row groups against the heaviest input
column and restarts with smaller groups if it overshoots.
(Source: sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale)
Canonical column-format contract¶
Across formats, the column-per-column-per-row-group layout gives compaction / merging a nice property:
Each column may be read and written in its entirety before moving on to the next column … the amount of memory used during the merge is bounded to one row group per input fragment. — (Source: Husky compaction post)
This is the property behind the patterns/streaming-k-way-merge compaction pattern: 1 GET per input, streamed in parallel, one column × one row group at a time.
Trade-offs vs row-oriented¶
- Point-reads and small-row writes are worse. Fetching a single row means touching every column. OLTP stores stay row-oriented.
- Schema breadth costs. Each column has its own header / encoding; wide schemas (hundreds of thousands of columns — common in multi-tenant observability tables) make the header itself a design concern (Husky embeds a column-position skip list to mitigate this).
- Mutations are expensive. Whole-column rewrites, or append-and-compact via LSM (concepts/lsm-compaction), are the usual answers.
Seen in¶
- systems/apache-parquet — de-facto columnar format for data lakes on object storage.
- systems/husky — custom Parquet-like columnar format tuned for observability (ultra-wide schemas, adaptive row-group size, header- embedded pruning predicates).
- systems/apache-iceberg, systems/s3-tables — open table formats layered over columnar Parquet data files.
- sources/2025-03-14-allthingsdistributed-s3-simplicity-is-table-stakes — "S3 stores exabytes of Parquet data and serves hundreds of PB/day".
- sources/2025-01-29-datadog-husky-efficient-compaction-at-datadog-scale — columnar layout as the enabler of streaming compaction + fragment pruning.