SYSTEM Cited by 1 source
Dash Feature Store¶
Definition¶
Dash Feature Store is Dropbox's internal ML concepts/feature-store powering ranking in Dropbox Dash. A single Dash query fans out across many candidate documents, each requiring dozens of behavioural and contextual features — thousands of parallel feature lookups under a sub-100ms budget. The system is an explicit hybrid of open- source and internal components, built because off-the-shelf feature stores couldn't bridge Dropbox's on-prem-serving + cloud-Spark split. (Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Architecture¶
Three-layer composition:
- Feast — orchestration layer, feature definitions, serving APIs; ML engineers write PySpark transformations, framework handles storage/serving plumbing. Adapter ecosystem the load-bearing reason for choice (DynamoDB adapter plumbs into Dynovault).
- Spark + cloud storage — offline indexing, feature computation, batch pipelines.
- Dynovault — online serving tier; Dropbox-internal DynamoDB-compatible store; co-located with inference workloads so feature lookups avoid public-internet round-trips; ~20ms client-side latency.
On top of this stack, the feature serving layer was rewritten from Python (Feast SDK) to Go — the public-facing HTTP / RPC endpoint that ranker servers call for feature lookups.
Feast-Python → Go serving rewrite¶
Initial serving on Feast's Python SDK used parallelism, but profiling showed CPU-bound JSON parsing + Python's GIL as the dominant bottleneck at high concurrency. Multi-process Python temporarily improved latency but "introduced coordination overhead that limited scalability."
Go rewrite:
- Goroutines give true concurrency without GIL serialisation.
- Shared memory and faster JSON parsing reduce CPU overhead.
- Outcome: thousands of requests per second, ~5–10ms overhead on top of Dynovault's ~20ms client latency, p95 ~25–35ms end-to-end.
Canonical instance of patterns/language-rewrite-for-concurrency, second in this wiki after Aurora DSQL's JVM → Rust journey; different language, same pattern (escape concurrency ceiling of the starting language).
(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Three-lane ingestion¶
Dropbox split feature ingestion into three complementary lanes — the canonical realisation of patterns/hybrid-batch-streaming-ingestion:
1. Batch ingestion¶
Complex high-volume transformations atop a medallion architecture (raw → refined layers). The lane where change detection lives:
Only 1–5% of feature values change in a typical 15-minute window. Detecting unchanged records before the online-store write reduced write volume from hundreds of millions → under 1 million records per run and update time from >1 hour → <5 minutes.
2. Streaming ingestion¶
Near-real-time signals (collaboration activity, content interactions). "Features stay aligned with what users are doing in the moment" — opening a doc, joining a Slack channel, etc. should surface in the next search within seconds.
3. Direct writes¶
Escape hatch for lightweight or precomputed features. Example: relevance scores from a separate LLM evaluation pipeline written directly to Dynovault in seconds, bypassing the batch cycle entirely.
(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Operational numbers¶
| Dimension | Value |
|---|---|
| Overall latency budget | sub-100ms |
| Dynovault client latency | ~20ms |
| Go service added overhead | ~5–10ms |
| End-to-end p95 | ~25–35ms |
| Go service throughput | thousands of req/s |
| Typical feature change rate (15-min window) | 1–5% |
| Batch write volume after change detection | hundreds of millions → <1M records/run |
| Batch run time after change detection | >1 hour → <5 minutes |
Design rationale¶
Why not off-the-shelf?¶
- Infrastructure is split: on-prem low-latency service-to-service
- Spark-native cloud feature engineering. Standard cloud-native feature stores didn't bridge both.
- Sub-100ms budget with thousands of parallel reads per query exceeded vendor offerings at Dropbox's shape/cost.
- Dynovault already existed as an internal primitive — the DynamoDB adapter in Feast made composition straightforward.
Why not fully in-house?¶
- Feast's clean separation of feature definitions from infrastructure concerns meant ML engineers could focus on PySpark transforms, not serving plumbing — worth inheriting.
- Feast's adapter ecosystem was the cheapest path to plug Dynovault in as the online tier.
"Middle path between building everything from scratch and adopting off-the-shelf systems wholesale" — the same framing Dropbox uses for Nucleus (Rust on top of upstream libraries), Robinhood (PID on top of xDS/Envoy/gRPC), and the 7th-gen hardware rollout (supplier co- development on standard chips).
Relationship to the Dash stack¶
- Upstream: signals ingested from content + collaboration events
- LLM evaluation pipelines.
- Lateral: Dash Search Index is the unified retrieval substrate; the feature store supplies the ranking features layered over retrieval results.
- Downstream: Dash's ranker consumes features during candidate-document scoring; LLM answer generation also depends on the retrieved + ranked context.
- Compute co-location: Dynovault sits near the inference tier (including the Gumby / Godzilla GPU tiers) — avoiding public-internet latency is the design assumption.
Caveats¶
- Latency numbers are ranges, not distributions. No p99/p99.9 reported.
- Dynovault internals (consistency, sharding, replication, fleet size) not described in this post.
- Change-detection mechanism is presented as a property outcome (write volume collapse) but the exact diff-and-suppress mechanism isn't described.
- Batch/streaming/direct-write assignment rule per feature isn't worked through.
- No before/after ranking-quality numbers for the freshness axis (the post asserts freshness matters, doesn't quantify).
Seen in¶
- sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — the canonical post describing the feature-store architecture, Python→Go rewrite, Dynovault co-location, and three-lane ingestion.