SYSTEM Cited by 1 source

Dash Feature Store¶

Definition¶

Dash Feature Store is Dropbox's internal ML concepts/feature-store powering ranking in Dropbox Dash. A single Dash query fans out across many candidate documents, each requiring dozens of behavioural and contextual features — thousands of parallel feature lookups under a sub-100ms budget. The system is an explicit hybrid of open- source and internal components, built because off-the-shelf feature stores couldn't bridge Dropbox's on-prem-serving + cloud-Spark split. (Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Architecture¶

Three-layer composition:

Feast — orchestration layer, feature definitions, serving APIs; ML engineers write PySpark transformations, framework handles storage/serving plumbing. Adapter ecosystem the load-bearing reason for choice (DynamoDB adapter plumbs into Dynovault).
Spark + cloud storage — offline indexing, feature computation, batch pipelines.
Dynovault — online serving tier; Dropbox-internal DynamoDB-compatible store; co-located with inference workloads so feature lookups avoid public-internet round-trips; ~20ms client-side latency.

On top of this stack, the feature serving layer was rewritten from Python (Feast SDK) to Go — the public-facing HTTP / RPC endpoint that ranker servers call for feature lookups.

Feast-Python → Go serving rewrite¶

Initial serving on Feast's Python SDK used parallelism, but profiling showed CPU-bound JSON parsing + Python's GIL as the dominant bottleneck at high concurrency. Multi-process Python temporarily improved latency but "introduced coordination overhead that limited scalability."

Go rewrite:

Goroutines give true concurrency without GIL serialisation.
Shared memory and faster JSON parsing reduce CPU overhead.
Outcome: thousands of requests per second, ~5–10ms overhead on top of Dynovault's ~20ms client latency, p95 ~25–35ms end-to-end.

Canonical instance of patterns/language-rewrite-for-concurrency, second in this wiki after Aurora DSQL's JVM → Rust journey; different language, same pattern (escape concurrency ceiling of the starting language).

(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Three-lane ingestion¶

Dropbox split feature ingestion into three complementary lanes — the canonical realisation of patterns/hybrid-batch-streaming-ingestion:

1. Batch ingestion¶

Complex high-volume transformations atop a medallion architecture (raw → refined layers). The lane where change detection lives:

Only 1–5% of feature values change in a typical 15-minute window. Detecting unchanged records before the online-store write reduced write volume from hundreds of millions → under 1 million records per run and update time from >1 hour → <5 minutes.

2. Streaming ingestion¶

Near-real-time signals (collaboration activity, content interactions). "Features stay aligned with what users are doing in the moment" — opening a doc, joining a Slack channel, etc. should surface in the next search within seconds.

3. Direct writes¶

Escape hatch for lightweight or precomputed features. Example: relevance scores from a separate LLM evaluation pipeline written directly to Dynovault in seconds, bypassing the batch cycle entirely.

(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Operational numbers¶

Dimension	Value
Overall latency budget	sub-100ms
Dynovault client latency	~20ms
Go service added overhead	~5–10ms
End-to-end p95	~25–35ms
Go service throughput	thousands of req/s
Typical feature change rate (15-min window)	1–5%
Batch write volume after change detection	hundreds of millions → <1M records/run
Batch run time after change detection	>1 hour → <5 minutes

Design rationale¶

Why not off-the-shelf?¶

Infrastructure is split: on-prem low-latency service-to-service
Spark-native cloud feature engineering. Standard cloud-native feature stores didn't bridge both.
Sub-100ms budget with thousands of parallel reads per query exceeded vendor offerings at Dropbox's shape/cost.
Dynovault already existed as an internal primitive — the DynamoDB adapter in Feast made composition straightforward.

Why not fully in-house?¶

Feast's clean separation of feature definitions from infrastructure concerns meant ML engineers could focus on PySpark transforms, not serving plumbing — worth inheriting.
Feast's adapter ecosystem was the cheapest path to plug Dynovault in as the online tier.

"Middle path between building everything from scratch and adopting off-the-shelf systems wholesale" — the same framing Dropbox uses for Nucleus (Rust on top of upstream libraries), Robinhood (PID on top of xDS/Envoy/gRPC), and the 7th-gen hardware rollout (supplier co- development on standard chips).

Relationship to the Dash stack¶

Upstream: signals ingested from content + collaboration events
LLM evaluation pipelines.
Lateral: Dash Search Index is the unified retrieval substrate; the feature store supplies the ranking features layered over retrieval results.
Downstream: Dash's ranker consumes features during candidate-document scoring; LLM answer generation also depends on the retrieved + ranked context.
Compute co-location: Dynovault sits near the inference tier (including the Gumby / Godzilla GPU tiers) — avoiding public-internet latency is the design assumption.

Caveats¶

Latency numbers are ranges, not distributions. No p99/p99.9 reported.
Dynovault internals (consistency, sharding, replication, fleet size) not described in this post.
Change-detection mechanism is presented as a property outcome (write volume collapse) but the exact diff-and-suppress mechanism isn't described.
Batch/streaming/direct-write assignment rule per feature isn't worked through.
No before/after ranking-quality numbers for the freshness axis (the post asserts freshness matters, doesn't quantify).

Seen in¶

sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — the canonical post describing the feature-store architecture, Python→Go rewrite, Dynovault co-location, and three-lane ingestion.