PATTERN Cited by 1 source
Change-detection ingestion¶
Definition¶
Change-detection ingestion is the optimization of comparing a newly-computed record against its current value in the online store before writing, and suppressing writes for unchanged values. Exploits the empirical observation that update distributions in many production workloads are sparse over the keyspace — only a small fraction of records actually change per batch cycle.
Why it works¶
In a batch ingestion pipeline over a large feature table, the naïve approach rewrites every record on every run. But:
- Most features are slow-moving. User-level aggregates, document embeddings, corpus statistics — they don't change most of the time.
- Online stores (DynamoDB-class key-value stores, Dynovault) charge and rate-limit on write volume, not compute volume.
- Rewriting unchanged values is pure amplification: N× the write cost, N× the replication lag, N× the cache invalidation churn, N× the impact on downstream consumers watching for changes.
Detecting unchanged records and dropping them collapses write volume to the actual change rate.
Dropbox Dash numbers¶
Canonical measurement from the 2025-12-18 post:
"By recognizing that only 1–5% of feature values change in a typical 15-minute window, we were able to dramatically reduce write volumes and ingestion time. This shift turned hour-long batch cycles into five-minute updates, improving freshness without increasing system load."
- Change rate: 1–5% per 15-minute window.
- Write volume: hundreds of millions → under 1 million records per run (~100× reduction).
- Run time: >1 hour → <5 minutes (~12× reduction).
(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Implementation shapes¶
The post doesn't spell out the detection mechanism, but common implementations include:
- Content hash diff — hash the newly-computed record, hash the current-stored value, write only if hashes differ. Cheap per record; requires a current-value read per record or a cached prior hash.
- Incremental computation — the compute pipeline itself tracks which inputs changed and only recomputes downstream records whose upstream changed (DAG-level).
- Upstream CDC filter — use a change-data-capture stream (concepts/change-data-capture) on the source systems to determine which records can possibly have changed, and only touch those. Adjacent lineage to this pattern.
- Timestamp/version columns — if records carry a
last_updatedthat can be compared pre-write.
Each shape trades read cost / compute cost / correctness guarantees. The Dropbox post doesn't disclose the specific choice.
Freshness / cost / complexity trade-off¶
Applying change detection:
- ✅ Collapses write amplification (cost win).
- ✅ Shortens batch-cycle run time (freshness win).
- ✅ Reduces downstream churn (cache invalidations, watcher wake-ups).
- ❌ Adds a read or hash-compare cost per record per run.
- ❌ Introduces a correctness risk if the detection logic misses a change (underwrite). Must be strictly safe — detector can have false positives (unnecessary writes) but never false negatives (missed changes).
Related¶
- patterns/hybrid-batch-streaming-ingestion — change detection lives in the batch lane of that pattern.
- concepts/feature-freshness — the freshness dimension this optimization improves.
- concepts/change-data-capture — adjacent upstream pattern; a CDC stream is one way to produce the "which records might have changed" signal.
- systems/dash-feature-store — the canonical instance.
Seen in¶
- sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — Dropbox's 15-minute / 1–5% change-rate observation collapsing batch write volume ~100× and run time ~12× for systems/dash-feature-store.