Skip to content

LYFT 2026-01-06 Tier 2

Read original ↗

Lyft — Lyft's Feature Store: Architecture, Optimization, and Evolution

Summary

Rohan Varshney (Lyft Engineering, 2026-01-06) describes Lyft's Feature Store — the shared ML substrate every consuming service and model at Lyft pulls features from — as a "platform of platforms" composed of three ingest/serve lanes (batch, online, streaming) and one unifying online-serving layer (dsfeatures, short for "data science features"). Batch features are defined with Spark SQL + a JSON config; a Python cron service auto-generates an Astronomer-hosted Airflow DAG per config that executes the query, writes to both offline (Hive) and online (dsfeatures) paths, and runs data-quality checks. dsfeatures is an optimized wrapper over three AWS stores: DynamoDB as persistent backing (primary key composed of metadata fields + GSI for GDPR-deletion efficiency); a ValKey write-through LRU cache for ultra-low-latency hot reads; and OpenSearch for embedding features. Streaming features go through Apache Flink reading from Kafka (or sometimes Kinesis) into a dedicated ingest-Flink application (spfeaturesingest) that writes to dsfeatures via its WRITE API — guaranteeing uniform metadata and strongly consistent reads regardless of ingestion path. User- facing API is two SDKs (go-lyft-features, lyft-dsp-features) exposing full CRUD. Governance is metadata-driven: ownership, urgency tiers, carryover/rollup logic, explicit naming + typing, versioning + lineage. Discoverability is outsourced to Amundsen — generated DAGs tag feature metadata automatically so engineers can find existing features before creating duplicates.

Key takeaways

  1. Feature Store as "platform of platforms" with three lanes. The system is not one pipeline but three complementary lanes — batch (Hive-table-defined features, SparkSQL-computed, typically daily cadence), online (dsfeatures serving + application-level CRUD), and streaming (Flink from Kafka / Kinesis) — all converging on the same online serving surface. This is Lyft's concrete realization of hybrid batch + streaming ingestion applied to a feature store, with the novel axis that an on-demand / direct-CRUD lane is exposed via the same SDK, not bolted on behind the pipeline. (Source: sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution)
  2. Config-driven DAG generation as the dominant feature-onboarding UX. Customers declare a feature with two files — a Spark SQL query + a JSON metadata config — and a Python cron service reads the configs and auto-generates an Astronomer-hosted Airflow DAG. Generated DAGs are "production-ready out-of-the-box" and handle (1) SparkSQL execution, (2) write to both offline + online paths, (3) integrated data-quality checks, (4) Amundsen metadata tagging for discoverability. This is config-driven DAG generation — the platform owns the pipeline boilerplate; customers own the feature-specific SQL + metadata only.
  3. dsfeatures is a wrapper over three AWS stores chosen for access-pattern fit, not a single database. The online serving layer deliberately composes:
  4. DynamoDB as persistent backing store — metadata fields as primary key, with a Global Secondary Index (GSI) for GDPR-deletion efficiency (so "delete all features for user X" is bounded-cost).
  5. ValKey write-through LRU cache on top — ultra-low-latency reads for hot/frequent (meta)data with a "generous TTL." The write-through shape keeps DynamoDB authoritative: writes land in both in order, so a cache miss always has a correct DynamoDB read to fall back to.
  6. OpenSearch for embedding features specifically, because kNN / vector similarity retrieval is OpenSearch's native shape and not DynamoDB's. This is the wrapper-over-heterogeneous-stores pattern: expose one SDK, route to the right store internally by feature type.
  7. Streaming lane has an explicit ingest-service choke point (spfeaturesingest). Customer Flink applications don't write to the online store directly — they emit feature payloads into the central spfeaturesingest Flink application ("Streaming Platform feature ingest") which owns (de)serialization and dsfeatures WRITE API interaction. The uniform-metadata invariant"regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads" — is what motivates the choke point: you can't have uniform metadata if each producer writes its own way.
  8. SDK-first access with full CRUD, not read-only. Two SDKs (go-lyft-features in Go, lyft-dsp-features in Python) wrap dsfeatures API. Most calls are Get / BatchGet, but the SDKs expose full Create/Read/Update/Delete, which lets internal Airflow DAGs write as part of the batch path and lets customers manage real-time features ad-hoc without going through the ingestion pipeline. This is the third lane — direct SDK writes — formalised as a first-class entry point.
  9. Governance is metadata-first; discoverability is outsourced to Amundsen. Every feature's JSON config carries ownership + urgency tier + carryover/rollup logic + explicit naming and data-typing + versioning semantics + lineage. Versioning rule is named: "If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected." The generated DAGs automatically tag Amundsen with this metadata, so discoverability is a pipeline side-effect, not a separate ceremony. The explicit goal is preventing duplicate-feature work.
  10. Kyte accelerates the feature-engineering loop. Lyft's homegrown Kyte environment ("Airflow local development at Lyft", originally presented at Airflow Summit 2022) provides a CLI that lets developers validate configurations, test SQL runs locally, execute DAG runs in a local environment, and confidently backfill previous dates against their DAGs. Feature development is primarily a SQL + JSON loop; Kyte optimises that loop rather than inventing a DSL.

Architecture

The three lanes

                 ┌─────────────────────────────────┐
                 │           Customers             │
                 │  (services / Airflow / apps)    │
                 └────────┬────────────────┬───────┘
                          │ SDK calls       │ SDK CRUD
                          ▼                 ▼
                   ┌────────────────────────────────┐
                   │          dsfeatures            │
                   │  (online serving + metadata)   │
                   └────────┬─────────┬──────────┬──┘
                            │         │          │
                 ┌──────────▼──┐  ┌───▼────┐  ┌──▼───────────┐
                 │  DynamoDB   │  │ValKey  │  │ OpenSearch   │
                 │ (backing)   │  │(cache) │  │ (embeddings) │
                 └─────────────┘  └────────┘  └──────────────┘
       ┌────────────────┼───────────────────────┐
       │                │                       │
       │         ┌──────┴───────┐        ┌──────┴──────────┐
       │         │ Generated    │        │spfeaturesingest │
       │         │ Airflow DAG  │        │    (Flink)      │
       │         │  (SparkSQL)  │        └────────┬────────┘
       │         └──────┬───────┘                 │
       │                │                         │ payload
       │         ┌──────┴───────┐          ┌──────┴──────────┐
       │         │ Hive offline │          │ customer Flink  │
       │         │   tables     │          │  apps (K/K)     │
       │         └──────────────┘          └─────────────────┘
       │                ▲                         ▲
       │                │                         │
       │ Customers define feature with SQL query + JSON config
       │                                          │
       │                                ┌─────────┴─────────┐
       │                                │ Kafka / Kinesis   │
       │                                └───────────────────┘

Batch lane

  1. Customer writes a SparkSQL query + a JSON config (feature metadata).
  2. Python cron service reads configs → generates an Astronomer- Airflow DAG.
  3. DAG executes query on Spark against Hive tables, generating a dataframe.
  4. Output written to both paths:
  5. Offline: Hive tables (for training + analysis).
  6. Online: translated and sent to dsfeatures.
  7. Integrated data-quality checks and Amundsen tagging happen in the same DAG.

Online serving layer (dsfeatures)

  • DynamoDB — persistent source; composite metadata PK; GSI for GDPR-deletion efficiency.
  • ValKey write-through LRU cache — hot-data low-latency path; "generous TTL".
  • OpenSearch — embedding features only (requires specialized indexing/retrieval).

SDKs (go-lyft-features, lyft-dsp-features) expose full CRUD; most calls are Get / BatchGet.

Streaming lane

  1. Customer Flink apps read events from Kafka (sometimes Kinesis).
  2. Customer Flink performs initial transformations + metadata creation + value formatting.
  3. Customer sinks feature payloads into spfeaturesingest — Lyft's central Flink ingest app.
  4. spfeaturesingest handles (de)serialization and calls dsfeatures WRITE API.

Uniform-metadata invariant: "Regardless of the ingestion method (batch, streaming, or on-demand), the Feature Store maintains uniform metadata and strongly consistent reads."

Direct-CRUD lane

  • Internal DAGs write via SDK.
  • Customers manage real-time / ad-hoc features by invoking SDK CRUD directly.

Operational details worth noting

  • Primary key design: metadata fields composed as PK; explicit GSI for GDPR deletion. Bounded-cost user-data deletion is the motivating requirement — named explicitly in the post, a concrete ergonomics-for-the-ops-team decision baked into the schema.
  • "Generous TTL" on the ValKey cache — the post does not give a number but emphasizes that hot metadata benefits disproportionately from long-lived cache.
  • OpenSearch-for-embeddings isolation: embedding features do not live in DynamoDB. This is an early shape-fit decision — not every KV is a KV at the access pattern level.
  • SparkSQL + JSON deliberately chosen over DSL: "We learned early on that our core personas are particularly proficient in SQL and place a high value on quick iteration." User-interface design choice optimising for adoption over expressiveness.
  • No custom feature DSL: the config file is plain JSON; the transformation language is SparkSQL. Complexity is absorbed by the platform's codegen step (Python cron → Airflow DAG).
  • Versioning rule named: "If the SQL or expected feature behavior undergoes business logic changes, a version bump is expected." Schema vs. semantic change is the cut.

Caveats

  • No absolute performance numbers. The post frequently uses "ultra-low-latency", "low-latency", "real-time", but gives no p50/p99, QPS, cache hit-rate, or cost numbers. This is a narrative post, not a performance retrospective.
  • No failure modes disclosed. The strongly-consistent-reads guarantee is asserted ("uniform metadata and strongly consistent reads" across lanes) without walking through how it's achieved under partial failure (e.g. DynamoDB write succeeds, ValKey cache invalidation fails).
  • Raw file truncated after the Amundsen section. The post evidently continues into operational workflow improvements / case studies, but the markdown body ends at the "Building Real-time Machine Learning Foundations at Lyft" image caption — additional content beyond Amundsen is not available on the wiki from this ingest.
  • spfeaturesingest internals not disclosed. We know it reads payloads, (de)serializes, writes via dsfeatures WRITE API — but not its parallelism model, ordering guarantees under rebalance, or DLQ behaviour.
  • ValKey + DynamoDB consistency under write failure isn't spelled out. "Write-through LRU" describes the steady-state path; the post doesn't discuss what happens if the cache write fails after the DynamoDB write succeeds (stale cache? rollback?).

Source

Last updated · 319 distilled / 1,201 read