Skip to content

PATTERN Cited by 1 source

Columnar time-partitioned feature storage

Pattern

Store ML features (especially user event sequences) in a columnar, time-partitioned layout that behaves like a set of tables: each enrichment / feature lives in its own column, data is partitioned by time bucket, and engineers can query the substrate with familiar table abstractions. Replaces "large, consolidated 'enriched event' blobs" where every read pulls the whole payload regardless of which features the consumer actually uses (Source: sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use).

Shape

               ┌────── one column per enrichment/feature ──────┐
time bucket    │                                                │
   2026-05-23  │ event_id │ ts │ surface │ embedding │ derived  │
   2026-05-24  │ event_id │ ts │ surface │ embedding │ derived  │
   2026-05-25  │ event_id │ ts │ surface │ embedding │ derived  │
               └────────────────────────────────────────────────┘
              read selects  │ only required columns
                            │ for the query
              I/O bounded to (time partition × column subset)

Two axes of optimisation:

  • Column-wise selection — readers project only the columns their model or query actually needs.
  • Time-bucket partitioning — writes + scans are constrained to relevant partitions; long-history accumulation doesn't bloat scan cost.

Why it matters

Pinterest names two axes of payoff:

Efficiency"Columnar storage improves compression and reduces network bandwidth by avoiding wide 'enriched event' blobs when only a few features are needed. Time partitioning keeps I/O bounded even as the system accumulates long histories."

Operability"Having clear table semantics makes it much easier to inspect anomalous days or event types, validate new enrichments, and compare old and new pipelines side by side."

The structural insight: by moving each enrichment into its own column rather than into a fat blob, adding a new enrichment becomes a column-add rather than a payload-shape change. Schema evolution is cheap. Old consumers ignore new columns; new consumers select what they need.

What changes vs the consolidated-blob predecessor

Property Consolidated-blob storage Columnar time-partitioned storage
Read I/O Pulls full payload regardless of need Pulls only requested columns
Network cost on serving High; payload size = blob size Low; payload size = sum of selected columns
Compression Per-record; mixed types limit ratio Per-column; uniform types compress well
Adding a new feature Reshape blob; backfill; bump consumers Add a column; old consumers ignore it
Inspecting a bad day Pull blobs, parse, filter SELECT … WHERE day = … in any SQL engine
Long-history scans Touches all blobs Time partitions prune the irrelevant majority

Multi-workload payoff

The same columnar / time-partitioned substrate serves three distinct workloads from one storage layout:

Workload Read pattern
Online inference Latest partitions; select feature columns the served model needs
Training Long history windows; select feature columns the model trains on
Offline analysis Arbitrary time + column slices

This is what "behaves like a set of tables" unlocks — engineers can use familiar SQL / DataFrame abstractions without per-workload bespoke storage interfaces.

Where this pattern fits

  • ML feature substrates with many enrichments and many tenants (Pinterest's user-sequence platform has many event types × many enrichments × many models).
  • Time-partitionable workloads — anything event-shaped with a meaningful timestamp column.
  • Workloads where column projection is a meaningful access pattern — most ML workloads use a small fraction of the available features per request.
  • Workloads that need both online and offline reads from the same substrate.

Where it doesn't fit

  • Row-update-heavy OLTP (no point-update columnar engine performs well).
  • Sub-millisecond P99 lookups by primary key — purpose-built KV or in-memory feature stores beat columnar at this niche.
  • Ultra-low-cardinality storage where column overhead exceeds blob overhead.

Required ingredients

  • Columnar file format — Parquet, ORC, or proprietary. Pinterest doesn't disclose their choice.
  • Time-partitioning convention — daily / hourly / event-time vs ingestion-time. Has to match read patterns.
  • Table semantics layer — Iceberg, Delta, Hudi, or in-house equivalent. Provides schema evolution, snapshot isolation, time travel.
  • Selective-column query engine — Spark, Trino, Athena, in-house — anything that supports column projection pushdown.

Sibling patterns

Caveats

  • Partition granularity is a trade-off. Too-fine partitions multiply file overhead; too-coarse partitions defeat the I/O bound. Pinterest doesn't disclose their grain choice.
  • Schema evolution still has limits. Adding columns is cheap; renaming or changing types isn't. Conventions for schema evolution (additive-only, deprecation cycles) matter.
  • Online-read latency. Columnar formats are optimised for scan throughput, not point lookup. Pinterest's online serving API likely materialises hot partitions in a faster substrate or uses cached column projections; the post doesn't disclose specifics.
  • Compaction cost. Streaming writes create many small files; periodic compaction has to be paid for somewhere.

Seen in

Last updated · 542 distilled / 1,571 read