PATTERN Cited by 1 source

Columnar time-partitioned feature storage¶

Pattern¶

Store ML features (especially user event sequences) in a columnar, time-partitioned layout that behaves like a set of tables: each enrichment / feature lives in its own column, data is partitioned by time bucket, and engineers can query the substrate with familiar table abstractions. Replaces "large, consolidated 'enriched event' blobs" where every read pulls the whole payload regardless of which features the consumer actually uses (Source: sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use).

Shape¶

               ┌────── one column per enrichment/feature ──────┐
time bucket    │                                                │
   2026-05-23  │ event_id │ ts │ surface │ embedding │ derived  │
   2026-05-24  │ event_id │ ts │ surface │ embedding │ derived  │
   2026-05-25  │ event_id │ ts │ surface │ embedding │ derived  │
               └────────────────────────────────────────────────┘
                            ▲
                            │
              read selects  │ only required columns
                            │ for the query
                            │
                            ▼
              I/O bounded to (time partition × column subset)

Two axes of optimisation:

Column-wise selection — readers project only the columns their model or query actually needs.
Time-bucket partitioning — writes + scans are constrained to relevant partitions; long-history accumulation doesn't bloat scan cost.

Why it matters¶

Pinterest names two axes of payoff:

Efficiency — "Columnar storage improves compression and reduces network bandwidth by avoiding wide 'enriched event' blobs when only a few features are needed. Time partitioning keeps I/O bounded even as the system accumulates long histories."

Operability — "Having clear table semantics makes it much easier to inspect anomalous days or event types, validate new enrichments, and compare old and new pipelines side by side."

The structural insight: by moving each enrichment into its own column rather than into a fat blob, adding a new enrichment becomes a column-add rather than a payload-shape change. Schema evolution is cheap. Old consumers ignore new columns; new consumers select what they need.

What changes vs the consolidated-blob predecessor¶

Property	Consolidated-blob storage	Columnar time-partitioned storage
Read I/O	Pulls full payload regardless of need	Pulls only requested columns
Network cost on serving	High; payload size = blob size	Low; payload size = sum of selected columns
Compression	Per-record; mixed types limit ratio	Per-column; uniform types compress well
Adding a new feature	Reshape blob; backfill; bump consumers	Add a column; old consumers ignore it
Inspecting a bad day	Pull blobs, parse, filter	`SELECT … WHERE day = …` in any SQL engine
Long-history scans	Touches all blobs	Time partitions prune the irrelevant majority

Multi-workload payoff¶

The same columnar / time-partitioned substrate serves three distinct workloads from one storage layout:

Workload	Read pattern
Online inference	Latest partitions; select feature columns the served model needs
Training	Long history windows; select feature columns the model trains on
Offline analysis	Arbitrary time + column slices

This is what "behaves like a set of tables" unlocks — engineers can use familiar SQL / DataFrame abstractions without per-workload bespoke storage interfaces.

Where this pattern fits¶

ML feature substrates with many enrichments and many tenants (Pinterest's user-sequence platform has many event types × many enrichments × many models).
Time-partitionable workloads — anything event-shaped with a meaningful timestamp column.
Workloads where column projection is a meaningful access pattern — most ML workloads use a small fraction of the available features per request.
Workloads that need both online and offline reads from the same substrate.

Where it doesn't fit¶

Row-update-heavy OLTP (no point-update columnar engine performs well).
Sub-millisecond P99 lookups by primary key — purpose-built KV or in-memory feature stores beat columnar at this niche.
Ultra-low-cardinality storage where column overhead exceeds blob overhead.

Required ingredients¶

Columnar file format — Parquet, ORC, or proprietary. Pinterest doesn't disclose their choice.
Time-partitioning convention — daily / hourly / event-time vs ingestion-time. Has to match read patterns.
Table semantics layer — Iceberg, Delta, Hudi, or in-house equivalent. Provides schema evolution, snapshot isolation, time travel.
Selective-column query engine — Spark, Trino, Athena, in-house — anything that supports column projection pushdown.

Sibling patterns¶

concepts/columnar-storage-format — the underlying primitive; this pattern applies it to per-event ML feature substrates with table semantics.
patterns/sort-by-request-id-for-columnar-compression — Pinterest's request-level deduplication pattern; complementary at the same substrate (sort key on top of columnar compression for an additional 10–50× compression on user-heavy columns).
concepts/clickhouse-mergetree-partition-by-time — sibling at observability / time-series altitude using ClickHouse MergeTree.
patterns/time-partitioned-mergetree-for-time-series — generalisation of the time-partition discipline to time-series.

Caveats¶

Partition granularity is a trade-off. Too-fine partitions multiply file overhead; too-coarse partitions defeat the I/O bound. Pinterest doesn't disclose their grain choice.
Schema evolution still has limits. Adding columns is cheap; renaming or changing types isn't. Conventions for schema evolution (additive-only, deprecation cycles) matter.
Online-read latency. Columnar formats are optimised for scan throughput, not point lookup. Pinterest's online serving API likely materialises hot partitions in a faster substrate or uses cached column projections; the post doesn't disclose specifics.
Compaction cost. Streaming writes create many small files; periodic compaction has to be paid for somewhere.

Seen in¶

sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use — first canonical wiki instance applied to user-sequence ML substrates with the explicit "column per enrichment + time partition + table semantics" triangle.