Skip to content

PATTERN Cited by 1 source

Lambda architecture for fresh and complete sequences

Pattern

Run two cooperating execution paths over the same logical signal definition:

  • Streaming path — processes events as they arrive, maintains a near-real-time view of user event sequences for online inference. Optimises for freshness.
  • Batch path — periodically recomputes enriched events + sequences from raw historical data, produces long sequences and reusable datasets for backfills + offline analysis. Optimises for completeness and correctness.

The two paths cooperate rather than compete: the streaming path owns "now", the batch path owns "fixing history" (Source: sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use).

Why two paths

Sequence consumers want two things that pull in opposite directions:

"On one hand, they need freshness: 'I want this morning's actions reflected in ranking now.' On the other hand, they care about completeness and correctness: 'If late events show up tomorrow, I still want my sequences and training data to be right.'"

Real-world data is messy:

"Events arrive late. Enrichment sources are recomputed or corrected. Backfills introduce new historical coverage months after the fact."

A single path can't optimise both. The streaming path is starved of late-arriving events and corrections; the batch path is too slow for inference. Two paths, each tuned to one objective, plus a defined merge policy, is the standard answer — see sequence quality dimensions for why these need to be tracked independently.

Shape

              ┌───────────── shared signal definition ─────────────┐
              │  configuration-as-code → portable JSON → engine    │
              └─────────────────────┬──────────────────────────────┘
                    ┌───────────────┴───────────────┐
                    ▼                               ▼
         ┌──────────────────┐               ┌──────────────────┐
         │  Streaming path  │               │   Batch path     │
         │  ──────────────  │               │  ──────────────  │
         │  source: Kafka   │               │  source: DW /    │
         │  cadence: real-  │               │  log archives /  │
         │  time, low-      │               │  snapshots       │
         │  latency writes  │               │  cadence: daily  │
         │                  │               │  / hourly recompute│
         │  output: "now"   │               │  output: "fixed   │
         │  view of seq     │               │  history" of seq  │
         └────────┬─────────┘               └────────┬─────────┘
                  │                                  │
                  └──────────────┬───────────────────┘
                  ┌──────────────────────────────────┐
                  │   Columnar time-partitioned      │
                  │   storage — both paths write     │
                  │   into time-bucket partitions    │
                  │   defined merge policy resolves  │
                  │   late arrivals + corrections    │
                  └──────────────────────────────────┘

Cost-control mechanism: shared executor logic

Classical Lambda architecture critiques (Kreps, "Questioning the Lambda Architecture", 2014) targeted the dual code path tax: two implementations of the same logic, two reconciliation flows, two operational surfaces. Pinterest mitigates this with shared execution engine + pluggable executors:

"The two paths cooperate instead of competing. The streaming path maintains the 'now' view of the world, while the batch path focuses on 'fixing history' and ensuring that training and long-term analytics see consistent, corrected data."

The same executors run in both runtimes. Only the scheduling shape differs:

  • Streaming → record-at-a-time, low-latency writes, narrow time windows.
  • Batch → bounded-table reads, rewrite of larger time windows, includes corrections.

This collapses the "two code paths" objection to "two scheduling shapes of one code path" — far cheaper to maintain.

Cooperation, not competition

The two paths address complementary slices of the quality contract:

Quality dimension Streaming path Batch path
Freshness ✅ primary secondary
Completeness secondary ✅ primary
Consistent enrichment shared executor logic shared executor logic
Stable schemas shared definition shared definition

Online inference reads dominantly from streaming path's outputs (or the most recent batch overlap). Training datasets and offline analysis read dominantly from batch path's outputs (longer history, late events incorporated).

Where this pattern fits

  • ML feature platforms, especially for sequence features and user-state features.
  • Real-time + historical analytics over the same event substrate.
  • Workloads where late-arriving / corrected data is structural (advertising attribution, fraud, financial reconciliation).
  • Multi-tenant data platforms supporting both online inference and training pipelines on the same underlying definitions.

Where it doesn't fit

  • Append-only event logs with no late arrivals + no corrections — single streaming path is enough.
  • Pure batch workloads — no need for a streaming path's complexity.
  • Substrates without a shared execution engine — the dual-code-path tax dominates and Kreps's objection applies.

Required ingredients

Caveats

  • Merge policy is hard to specify — last-write-wins, batch-overrides-after-T, per-column reconciliation each have failure modes. Pinterest names "a clear policy for how the two paths merge" as a goal but doesn't disclose the specifics.
  • Streaming path can't see all late events. The streaming path's view is necessarily incomplete; consumers who require completeness must read from a recent batch partition, not from streaming output.
  • Schema rollouts are doubly tricky. A schema change has to land in both paths simultaneously, including across in-flight streaming jobs and queued batch jobs.
  • Dashboards must distinguish the two paths. Freshness dashboards measure streaming; completeness dashboards measure batch. Mixing them masks both.
  • Cost ratio between paths is workload-dependent. The batch path can dominate cost (long history rewrites) or be cheap (small late-arrival deltas via CDF-style incremental processing). Pinterest doesn't disclose their split.

Sibling patterns

  • Kappa architecture — single streaming path with replayability instead of two paths. Trades operational complexity (one path) for substrate complexity (replayable log + reprocessing). Pinterest explicitly chose Lambda for this workload.
  • CDC-based incremental processing (patterns/cdf-incremental-replacing-full-rescan) — sibling at the batch path: rather than full-rescan recompute, process only what changed since last batch.
  • patterns/dual-tier-observability-tsdb-plus-lakehouse — sibling at observability altitude: hot tier for fresh queries, lakehouse tier for completeness + analytics.

Seen in

Last updated · 542 distilled / 1,571 read