Lyft — Lyft's Feature Store: Architecture, Optimization, and Evolution¶
Summary¶
Rohan Varshney (Lyft Engineering, 2026-01-06) describes Lyft's
Feature Store — the shared ML
substrate every consuming service and model at Lyft pulls features
from — as a "platform of platforms"
composed of three ingest/serve lanes (batch, online, streaming) and
one unifying online-serving layer
(dsfeatures, short for "data science
features"). Batch features are defined with Spark
SQL + a JSON config; a Python cron service
auto-generates an Astronomer-hosted
Airflow DAG per config that executes the query, writes to both
offline (Hive) and online (dsfeatures)
paths, and runs data-quality checks. dsfeatures is an
optimized
wrapper over three AWS stores: DynamoDB as
persistent backing (primary key composed of metadata fields + GSI
for GDPR-deletion efficiency); a
ValKey
write-through LRU cache for
ultra-low-latency hot reads; and OpenSearch
for embedding features. Streaming features go through
Apache Flink reading from
Kafka (or sometimes
Kinesis) into a dedicated
ingest-Flink application (spfeaturesingest) that writes to
dsfeatures via its WRITE API — guaranteeing uniform metadata and
strongly consistent reads regardless of ingestion path. User-
facing API is two SDKs (go-lyft-features, lyft-dsp-features)
exposing full CRUD. Governance is metadata-driven: ownership,
urgency tiers, carryover/rollup logic, explicit naming + typing,
versioning + lineage. Discoverability is outsourced to
Amundsen — generated DAGs tag feature metadata
automatically so engineers can find existing features before
creating duplicates.
Key takeaways¶
- Feature Store as "platform of platforms" with three lanes.
The system is not one pipeline but three complementary lanes —
batch (Hive-table-defined features, SparkSQL-computed,
typically daily cadence), online (
dsfeaturesserving + application-level CRUD), and streaming (Flink from Kafka / Kinesis) — all converging on the same online serving surface. This is Lyft's concrete realization of hybrid batch + streaming ingestion applied to a feature store, with the novel axis that an on-demand / direct-CRUD lane is exposed via the same SDK, not bolted on behind the pipeline. (Source: sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution) - Config-driven DAG generation as the dominant feature-onboarding UX. Customers declare a feature with two files — a Spark SQL query + a JSON metadata config — and a Python cron service reads the configs and auto-generates an Astronomer-hosted Airflow DAG. Generated DAGs are "production-ready out-of-the-box" and handle (1) SparkSQL execution, (2) write to both offline + online paths, (3) integrated data-quality checks, (4) Amundsen metadata tagging for discoverability. This is config-driven DAG generation — the platform owns the pipeline boilerplate; customers own the feature-specific SQL + metadata only.
dsfeaturesis a wrapper over three AWS stores chosen for access-pattern fit, not a single database. The online serving layer deliberately composes:- DynamoDB as persistent backing store — metadata fields as primary key, with a Global Secondary Index (GSI) for GDPR-deletion efficiency (so "delete all features for user X" is bounded-cost).
- ValKey write-through LRU cache on top — ultra-low-latency reads for hot/frequent (meta)data with a "generous TTL." The write-through shape keeps DynamoDB authoritative: writes land in both in order, so a cache miss always has a correct DynamoDB read to fall back to.
- OpenSearch for embedding features specifically, because kNN / vector similarity retrieval is OpenSearch's native shape and not DynamoDB's. This is the wrapper-over-heterogeneous-stores pattern: expose one SDK, route to the right store internally by feature type.
- Streaming lane has an explicit ingest-service choke point
(
spfeaturesingest). Customer Flink applications don't write to the online store directly — they emit feature payloads into the centralspfeaturesingestFlink application ("Streaming Platform feature ingest") which owns (de)serialization anddsfeaturesWRITE API interaction. The uniform-metadata invariant — "regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads" — is what motivates the choke point: you can't have uniform metadata if each producer writes its own way. - SDK-first access with full CRUD, not read-only. Two SDKs
(
go-lyft-featuresin Go,lyft-dsp-featuresin Python) wrapdsfeaturesAPI. Most calls areGet/BatchGet, but the SDKs expose full Create/Read/Update/Delete, which lets internal Airflow DAGs write as part of the batch path and lets customers manage real-time features ad-hoc without going through the ingestion pipeline. This is the third lane — direct SDK writes — formalised as a first-class entry point. - Governance is metadata-first; discoverability is outsourced to Amundsen. Every feature's JSON config carries ownership + urgency tier + carryover/rollup logic + explicit naming and data-typing + versioning semantics + lineage. Versioning rule is named: "If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected." The generated DAGs automatically tag Amundsen with this metadata, so discoverability is a pipeline side-effect, not a separate ceremony. The explicit goal is preventing duplicate-feature work.
- Kyte accelerates the feature-engineering loop. Lyft's homegrown Kyte environment ("Airflow local development at Lyft", originally presented at Airflow Summit 2022) provides a CLI that lets developers validate configurations, test SQL runs locally, execute DAG runs in a local environment, and confidently backfill previous dates against their DAGs. Feature development is primarily a SQL + JSON loop; Kyte optimises that loop rather than inventing a DSL.
Architecture¶
The three lanes¶
┌─────────────────────────────────┐
│ Customers │
│ (services / Airflow / apps) │
└────────┬────────────────┬───────┘
│ SDK calls │ SDK CRUD
▼ ▼
┌────────────────────────────────┐
│ dsfeatures │
│ (online serving + metadata) │
└────────┬─────────┬──────────┬──┘
│ │ │
┌──────────▼──┐ ┌───▼────┐ ┌──▼───────────┐
│ DynamoDB │ │ValKey │ │ OpenSearch │
│ (backing) │ │(cache) │ │ (embeddings) │
└─────────────┘ └────────┘ └──────────────┘
▲
┌────────────────┼───────────────────────┐
│ │ │
│ ┌──────┴───────┐ ┌──────┴──────────┐
│ │ Generated │ │spfeaturesingest │
│ │ Airflow DAG │ │ (Flink) │
│ │ (SparkSQL) │ └────────┬────────┘
│ └──────┬───────┘ │
│ │ │ payload
│ ┌──────┴───────┐ ┌──────┴──────────┐
│ │ Hive offline │ │ customer Flink │
│ │ tables │ │ apps (K/K) │
│ └──────────────┘ └─────────────────┘
│ ▲ ▲
│ │ │
│ Customers define feature with SQL query + JSON config
│ │
│ ┌─────────┴─────────┐
│ │ Kafka / Kinesis │
│ └───────────────────┘
Batch lane¶
- Customer writes a SparkSQL query + a JSON config (feature metadata).
- Python cron service reads configs → generates an Astronomer- Airflow DAG.
- DAG executes query on Spark against Hive tables, generating a dataframe.
- Output written to both paths:
- Offline: Hive tables (for training + analysis).
- Online: translated and sent to
dsfeatures. - Integrated data-quality checks and Amundsen tagging happen in the same DAG.
Online serving layer (dsfeatures)¶
- DynamoDB — persistent source; composite metadata PK; GSI for GDPR-deletion efficiency.
- ValKey write-through LRU cache — hot-data low-latency path; "generous TTL".
- OpenSearch — embedding features only (requires specialized indexing/retrieval).
SDKs (go-lyft-features, lyft-dsp-features) expose full CRUD;
most calls are Get / BatchGet.
Streaming lane¶
- Customer Flink apps read events from Kafka (sometimes Kinesis).
- Customer Flink performs initial transformations + metadata creation + value formatting.
- Customer sinks feature payloads into
spfeaturesingest— Lyft's central Flink ingest app. spfeaturesingesthandles (de)serialization and callsdsfeaturesWRITE API.
Uniform-metadata invariant: "Regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads."
Direct-CRUD lane¶
- Internal DAGs write via SDK.
- Customers manage real-time / ad-hoc features by invoking SDK CRUD directly.
Operational details worth noting¶
- Primary key design: metadata fields composed as PK; explicit GSI for GDPR deletion. Bounded-cost user-data deletion is the motivating requirement — named explicitly in the post, a concrete ergonomics-for-the-ops-team decision baked into the schema.
- "Generous TTL" on the ValKey cache — the post does not give a number but emphasizes that hot metadata benefits disproportionately from long-lived cache.
- OpenSearch-for-embeddings isolation: embedding features do not live in DynamoDB. This is an early shape-fit decision — not every KV is a KV at the access pattern level.
- SparkSQL + JSON deliberately chosen over DSL: "We learned early on that our core personas are particularly proficient in SQL and place a high value on quick iteration." User-interface design choice optimising for adoption over expressiveness.
- No custom feature DSL: the config file is plain JSON; the transformation language is SparkSQL. Complexity is absorbed by the platform's codegen step (Python cron → Airflow DAG).
- Versioning rule named: "If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected." Schema vs. semantic change is the cut.
Caveats¶
- No absolute performance numbers. The post frequently uses "ultra-low-latency", "low-latency", "real-time", but gives no p50/p99, QPS, cache hit-rate, or cost numbers. This is a narrative post, not a performance retrospective.
- No failure modes disclosed. The strongly-consistent-reads guarantee is asserted ("uniform metadata and strongly consistent reads" across lanes) without walking through how it's achieved under partial failure (e.g. DynamoDB write succeeds, ValKey cache invalidation fails).
- Raw file truncated after the Amundsen section. The post evidently continues into operational workflow improvements / case studies, but the markdown body ends at the "Building Real-time Machine Learning Foundations at Lyft" image caption — additional content beyond Amundsen is not available on the wiki from this ingest.
spfeaturesingestinternals not disclosed. We know it reads payloads, (de)serializes, writes viadsfeaturesWRITE API — but not its parallelism model, ordering guarantees under rebalance, or DLQ behaviour.- ValKey + DynamoDB consistency under write failure isn't spelled out. "Write-through LRU" describes the steady-state path; the post doesn't discuss what happens if the cache write fails after the DynamoDB write succeeds (stale cache? rollback?).
Source¶
- Original: https://eng.lyft.com/lyfts-feature-store-architecture-optimization-and-evolution-7835f8962b99?source=rss----25cd379abb8---4
- Raw markdown:
raw/lyft/2026-01-06-lyfts-feature-store-architecture-optimization-and-evolution-615733d0.md
Related¶
- companies/lyft
- systems/lyft-feature-store — the platform described here.
- systems/lyft-dsfeatures — the online serving layer.
- systems/valkey
- systems/apache-flink
- systems/apache-airflow
- systems/amundsen
- systems/dynamodb
- systems/opensearch
- systems/kafka
- systems/amazon-kinesis-data-streams
- systems/apache-hive
- systems/apache-spark
- concepts/feature-store
- concepts/feature-freshness
- concepts/training-serving-boundary — feature-store shape as a boundary-crossing primitive (unifies values across training + serving fleets even when compute is split).
- concepts/write-through-cache
- concepts/feature-discoverability
- patterns/hybrid-batch-streaming-ingestion — Lyft is a second canonical instance (Dropbox Dash being the first).
- patterns/config-driven-dag-generation
- patterns/batch-plus-streaming-plus-ondemand-feature-serving
- patterns/wrapper-over-heterogeneous-stores-as-serving-layer