PATTERN Cited by 1 source
Hybrid batch + streaming + direct-write ingestion¶
Definition¶
Hybrid batch + streaming + direct-write ingestion is the architectural pattern of splitting feature (or more generally data) ingestion into three complementary lanes — each matching a different freshness/cost/complexity trade-off — instead of forcing all features through a single pipeline.
The three lanes:
- Batch — periodic, heavy joins/aggregations over historical windows, typically Spark-backed; hours- to-minutes freshness. Absorbs the high-volume transformations that can't run online.
- Streaming — unbounded near-real-time processing of events; seconds-to-minutes freshness. Handles signals that must reflect what users are doing right now.
- Direct writes — application-level writes straight into the online store, bypassing the ingestion pipeline entirely; seconds freshness. Escape hatch for lightweight or precomputed features.
Why three lanes and not one¶
Every single-path design eventually fails somewhere:
- Batch-only misses the freshness SLO for interaction signals ("user opened doc 2s ago should surface in next search").
- Streaming-only makes heavy joins/aggregations over historical windows expensive or impractical at scale.
- Direct-write-only requires the application to compute every feature itself — defeats the purpose of a shared feature store.
The hybrid approach lets each feature hit the lane that matches its shape:
- Signals that need hours of history but minutes of freshness → batch with change detection.
- Signals that need seconds of freshness → streaming.
- Signals computed by an adjacent pipeline (e.g. an LLM evaluation service) that just need to land in the store → direct writes.
Dropbox Dash realization¶
Canonical instance from the 2025-12-18 Dash feature-store post:
- Batch ingestion on a medallion architecture (raw → refined layers), with intelligent change detection — see patterns/change-detection-ingestion. "Reduced write volumes from hundreds of millions to under one million records per run and cut update times from more than an hour to under five minutes."
- Streaming ingestion "captures fast-moving signals such as collaboration activity or content interactions. By processing unbounded datasets in near-real time, it ensures features stay aligned with what users are doing in the moment."
- Direct writes "handle lightweight or precomputed features by bypassing batch pipelines entirely. For example, relevance scores produced by a separate LLM evaluation pipeline can be written directly to the online store in seconds instead of waiting for the next batch cycle."
(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Relation to classic Lambda / Kappa architectures¶
- Lambda architecture is the original "hot path + cold path" pattern (real-time + batch) reconciling to the same serving layer. Hybrid batch + streaming is the direct descendant.
- Kappa architecture argued the batch path was unnecessary — everything through the stream. In practice, for heavy historical aggregations at the exabyte-adjacent scale of a ranking feature store, the batch path returns because streaming those transformations is uneconomical.
- The direct-write lane is the Dropbox-named third lane — not a hot-vs-cold distinction but a "skip the pipeline entirely for features whose producer is already downstream of it" escape.
When to use it¶
Apply this pattern when:
- Feature / data freshness requirements vary across features on the same substrate.
- Heavy historical joins/aggregations can't fit in the streaming lane's cost/latency envelope.
- An adjacent pipeline already produces some features and just needs a place to land them.
Don't apply it when:
- All features have uniform freshness requirements — pick one lane.
- The online store can't absorb direct writes from multiple producers without coordination (contention, quota, consistency issues).
- The ingestion complexity tax (three pipelines + assignment rule per feature) isn't justified by the freshness/cost diversity.
Related¶
- concepts/feature-freshness — what the three lanes optimize for.
- patterns/change-detection-ingestion — the optimization that makes the batch lane viable at cost.
- concepts/feature-store — the context in which this pattern lives.
Seen in¶
- sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — canonical Dropbox Dash feature-store realization (batch + streaming + direct writes).