Skip to content

SYSTEM Cited by 1 source

Dash Feature Store

Definition

Dash Feature Store is Dropbox's internal ML concepts/feature-store powering ranking in Dropbox Dash. A single Dash query fans out across many candidate documents, each requiring dozens of behavioural and contextual features — thousands of parallel feature lookups under a sub-100ms budget. The system is an explicit hybrid of open- source and internal components, built because off-the-shelf feature stores couldn't bridge Dropbox's on-prem-serving + cloud-Spark split. (Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Architecture

Three-layer composition:

  1. Feast — orchestration layer, feature definitions, serving APIs; ML engineers write PySpark transformations, framework handles storage/serving plumbing. Adapter ecosystem the load-bearing reason for choice (DynamoDB adapter plumbs into Dynovault).
  2. Spark + cloud storage — offline indexing, feature computation, batch pipelines.
  3. Dynovault — online serving tier; Dropbox-internal DynamoDB-compatible store; co-located with inference workloads so feature lookups avoid public-internet round-trips; ~20ms client-side latency.

On top of this stack, the feature serving layer was rewritten from Python (Feast SDK) to Go — the public-facing HTTP / RPC endpoint that ranker servers call for feature lookups.

Feast-Python → Go serving rewrite

Initial serving on Feast's Python SDK used parallelism, but profiling showed CPU-bound JSON parsing + Python's GIL as the dominant bottleneck at high concurrency. Multi-process Python temporarily improved latency but "introduced coordination overhead that limited scalability."

Go rewrite:

  • Goroutines give true concurrency without GIL serialisation.
  • Shared memory and faster JSON parsing reduce CPU overhead.
  • Outcome: thousands of requests per second, ~5–10ms overhead on top of Dynovault's ~20ms client latency, p95 ~25–35ms end-to-end.

Canonical instance of patterns/language-rewrite-for-concurrency, second in this wiki after Aurora DSQL's JVM → Rust journey; different language, same pattern (escape concurrency ceiling of the starting language).

(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Three-lane ingestion

Dropbox split feature ingestion into three complementary lanes — the canonical realisation of patterns/hybrid-batch-streaming-ingestion:

1. Batch ingestion

Complex high-volume transformations atop a medallion architecture (raw → refined layers). The lane where change detection lives:

Only 1–5% of feature values change in a typical 15-minute window. Detecting unchanged records before the online-store write reduced write volume from hundreds of millions → under 1 million records per run and update time from >1 hour → <5 minutes.

2. Streaming ingestion

Near-real-time signals (collaboration activity, content interactions). "Features stay aligned with what users are doing in the moment" — opening a doc, joining a Slack channel, etc. should surface in the next search within seconds.

3. Direct writes

Escape hatch for lightweight or precomputed features. Example: relevance scores from a separate LLM evaluation pipeline written directly to Dynovault in seconds, bypassing the batch cycle entirely.

(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)

Operational numbers

Dimension Value
Overall latency budget sub-100ms
Dynovault client latency ~20ms
Go service added overhead ~5–10ms
End-to-end p95 ~25–35ms
Go service throughput thousands of req/s
Typical feature change rate (15-min window) 1–5%
Batch write volume after change detection hundreds of millions → <1M records/run
Batch run time after change detection >1 hour → <5 minutes

Design rationale

Why not off-the-shelf?

  • Infrastructure is split: on-prem low-latency service-to-service
  • Spark-native cloud feature engineering. Standard cloud-native feature stores didn't bridge both.
  • Sub-100ms budget with thousands of parallel reads per query exceeded vendor offerings at Dropbox's shape/cost.
  • Dynovault already existed as an internal primitive — the DynamoDB adapter in Feast made composition straightforward.

Why not fully in-house?

  • Feast's clean separation of feature definitions from infrastructure concerns meant ML engineers could focus on PySpark transforms, not serving plumbing — worth inheriting.
  • Feast's adapter ecosystem was the cheapest path to plug Dynovault in as the online tier.

"Middle path between building everything from scratch and adopting off-the-shelf systems wholesale" — the same framing Dropbox uses for Nucleus (Rust on top of upstream libraries), Robinhood (PID on top of xDS/Envoy/gRPC), and the 7th-gen hardware rollout (supplier co- development on standard chips).

Relationship to the Dash stack

  • Upstream: signals ingested from content + collaboration events
  • LLM evaluation pipelines.
  • Lateral: Dash Search Index is the unified retrieval substrate; the feature store supplies the ranking features layered over retrieval results.
  • Downstream: Dash's ranker consumes features during candidate-document scoring; LLM answer generation also depends on the retrieved + ranked context.
  • Compute co-location: Dynovault sits near the inference tier (including the Gumby / Godzilla GPU tiers) — avoiding public-internet latency is the design assumption.

Caveats

  • Latency numbers are ranges, not distributions. No p99/p99.9 reported.
  • Dynovault internals (consistency, sharding, replication, fleet size) not described in this post.
  • Change-detection mechanism is presented as a property outcome (write volume collapse) but the exact diff-and-suppress mechanism isn't described.
  • Batch/streaming/direct-write assignment rule per feature isn't worked through.
  • No before/after ranking-quality numbers for the freshness axis (the post asserts freshness matters, doesn't quantify).

Seen in

Last updated · 200 distilled / 1,178 read