Lyft¶
Lyft Engineering blog is a Tier-2 source on the sysdesign-wiki. Lyft runs a large Envoy-fronted ridesharing platform (Envoy itself originated at Lyft), heavy polyglot backend (Python + Go + Java) with iOS + Android mobile clients, and the usual infra surface — service meshes, feature flags, rider/driver matching, trip lifecycle state. The blog's historical strength has been systems-at-scale content (Envoy, Flyte, service-mesh operations) plus mobile networking and protocol design for mobile-to-server + server-to-server communication. The 2020 post by Michael Rebello on Lyft's mobile networking journey documents the company-wide adoption of protobuf for mobile traffic; the 2024-09-16 Lyft Media post that's first ingested here is a direct descendant of that adoption.
The 2025-11-18 and 2026-01-06 posts opened a second Lyft theme on
the wiki: ML platform + data infra. LyftLearn 2.0 is the ML
platform (split compute from serving — SageMaker for training,
Kubernetes/EKS for inference). The Lyft Feature Store is the data
substrate underneath both — a "platform of platforms" with three
ingestion lanes (batch / streaming / direct-CRUD) converging on
dsfeatures, a wrapper over DynamoDB +
ValKey + OpenSearch.
The 2026-04-23 post opened a third Lyft theme: Mapping + pickup UX. Lyft's Mapping team covers map data, pickup-spot recommendations, routing, and the rider/driver app surfaces — four pieces that the "Smarter Pickup Experience for Gated Communities" project weaves together into a single named playbook for encoding real-world physical constraints into the map. Gated communities make up 25–30% of rides in selected markets; the fix generalises explicitly to road closures, unsafe curbs, parades, and marathons — same four-step pattern applied to a different spatial constraint.
Key systems¶
Metric governance (MSL, 2026-06-10 post)¶
- systems/lyft-metric-semantic-layer — Metric Semantic Layer; a centralized, versioned Python package serving as the single authoritative repository for every "Golden Metric" definition. YAML configs + Jinja-templated SQL + Python API + MCP server. Integrated with Amundsen for discoverability and a self-service Metric UI for no-code SQL generation.
Mapping / pickup UX (gated-community pickup, 2026-04-23 post)¶
- systems/lyft-gate-area-generator — Map Data team algorithm that generates gate-area polygons for gated communities from OpenStreetMap + driver feedback. Handles single-entrance apartment complexes through multi-gate developments with internal road networks. Feeds the rider app's "gates mode" auto-detect on app open.
- systems/lyft-pickup-routing — Routing team's pickup-routing subsystem. Inserts the gate as an invisible intermediate waypoint for gated-community pickups, giving the driver app a precise UX timing anchor for surfacing gate instructions.
- systems/lyft-rider-app — rider-facing mobile app; host of the "gates mode" dual inside/outside-gate pickup-spot selection UI and the intercom-style numpad + plain-language list for gate-instruction sharing.
- systems/lyft-driver-app — driver-facing mobile app; host of the scannable gate-instruction banner timed off the routing waypoint, with screenshot prevention for code exfiltration control.
Protocol + schema design (mobile + backend)¶
- systems/protobuf — Lyft Media canonicalises design practices for the shared-schema case (mobile + backend). Used extensively across both mobile-to-server and server-to-server traffic per Rebello's 2020 post.
- systems/protoc-gen-validate (PGV) / protovalidate — Lyft Media's declarative validation layer over protobuf schemas. Plugin author is not Lyft but Lyft Media uses it as the standard validator.
- systems/envoy — Lyft-originated L7 proxy; pre-existing wiki reference.
ML platform¶
- systems/lyftlearn / systems/lyftlearn-serving /
systems/lyftlearn-compute — Lyft's ML platform, split after
LyftLearn 2.0 into Kubernetes/EKS serving (
lyftlearn-serving) and SageMaker-based training/batch/notebooks (lyftlearn-compute).
ML data infra (feature store)¶
- systems/lyft-feature-store — "platform of platforms": three ingestion lanes (batch / streaming / direct-CRUD) with strongly-consistent reads + uniform metadata across lanes. Canonical second-major-co feature-store on the wiki after Dropbox Dash.
- systems/lyft-dsfeatures — the unified online serving layer;
wraps DynamoDB (persistent, GSI for GDPR
deletion) + ValKey (write-through LRU cache) +
OpenSearch (embeddings only). Exposes full
CRUD via
go-lyft-features+lyft-dsp-featuresSDKs. - systems/amundsen — Lyft's own open-source data-discovery platform; Feature Store DAGs automatically tag feature metadata here so engineers can find existing features before creating duplicates.
- systems/apache-airflow — Astronomer-hosted; runs the auto-generated feature DAGs.
- systems/apache-flink — streaming-feature lane, reading from
Kafka (or sometimes
Kinesis) and writing
through the central
spfeaturesingestFlink choke-point app. - systems/apache-spark + systems/apache-hive — SparkSQL as the batch-feature transformation language; Hive as the offline feature store.
Key patterns / concepts¶
Metric governance (2026-06-10 MSL post)¶
- concepts/metric-definition-drift — the anti-pattern MSL solves: different teams maintain different SQL for the same metric
- concepts/metric-definition-as-code — treating metric SQL as versioned, package-distributed code
- concepts/golden-metric-selection-criteria — the ≥2 use-case threshold for onboarding metrics into MSL
- concepts/dual-owner-metric-governance — Business Owner + Operational Owner model; teams, never individuals
- patterns/yaml-config-driven-metric-definitions — YAML + Jinja templates for DRY metric SQL generation
- patterns/jinja-templated-sql-generation — parameterized SQL via Jinja for time granularity + dimensions
- patterns/dual-owner-approval-for-metric-changes — mandatory dual sign-off as governance gate
- patterns/metric-semantic-layer-as-ai-knowledge-base — structured YAML definitions as grounding context for MCP-based AI agents
Mapping / pickup UX (2026-04-23 gated-community post)¶
- concepts/gated-community-pickup — the problem shape (25–30% of rides in some markets; dual root-cause of inflexible spot selection + communication black hole).
- concepts/pickup-spot-recommendation — the broader ride-share problem of picking the right spot to meet.
- concepts/virtual-waypoint-routing — using an invisible intermediate stop on a driver's route as a hook for surfacing context at the right moment, not as a real stop.
- concepts/historical-pickup-heatmap — using where past riders actually met drivers as the recommendation signal over topology.
- concepts/timing-based-information-surfacing — show information at the moment it's actionable, not earlier; driver-safety-motivated.
- concepts/ephemeral-sensitive-data — gate codes as the canonical instance: never stored between trips, audience of one, screenshot-blocked.
- concepts/ride-cancellation-rate — primary business metric for the gated-community pickup fix.
- concepts/rider-driver-communication-black-hole — anti-pattern of last-mile coordination falling back to ad-hoc text/call because the platform offers no proactive context channel.
- patterns/intermediate-waypoint-for-context-surfacing — insert a virtual routing waypoint so the UX has a timing anchor; routing is the enabler, UX timing is the purpose.
- patterns/historical-usage-for-pickup-spot-suggestion — surface meet-point recommendations from historical successful-match data.
- patterns/map-encoded-real-world-constraint — the four-step playbook Lyft explicitly names as reusable (encode in map → surface in recs → thread through routing → surface at right moment). Generalises from gates to road closures to unsafe curbs.
- patterns/ephemeral-per-trip-sensitive-input — the per-trip privacy contract for the gate-code channel.
- patterns/controlled-experiment-before-shipping — A/B test primary metric should be upstream-funnel completion (ride requests), not the feature's own metric.
- patterns/familiar-ui-borrowed-from-adjacent-flow — intercom-numpad borrows from physical intercoms + Lyft's existing Venues pickup flow to drop learning cost.
Protobuf / protocol design¶
- concepts/clarity-over-efficiency-in-protocol-design — first of Lyft Media's two named principles for protobuf design
- concepts/extensibility-protocol-design — second principle;
prefer structures (
oneof, well-known types,stringIDs) that admit future additions - concepts/unknown-zero-enum-value — reserve
0asUNKNOWNon every enum - concepts/unit-suffix-field-naming —
payload_size_bytes,timestamp_ms_utc, not raw primitives - concepts/proto3-explicit-optional — use
optionallabel (proto3 ≥ 3.15) orgoogle.protobuf.*Valuewrappers for presence semantics on primitives - patterns/oneof-over-enum-plus-field — model variant messages
with
oneof, not with discriminator-enum + sibling optional fields - patterns/protobuf-validation-rules — declarative validation
inline in the
.proto; generated validators must be invoked explicitly - patterns/protobuf-cross-entity-constants — custom
EnumValueOptionsextensions to share literal constants between mobile and backend
ML feature store (2026-01-06 Feature Store post)¶
- concepts/feature-store — Lyft is the second canonical major-tech-co instance.
- concepts/feature-freshness — batch / streaming / on-demand lanes map to different freshness tiers; "ultra-low-latency" cache + "near-real-time" streaming path.
- concepts/write-through-cache — canonical example: ValKey
over DynamoDB inside
dsfeatures. - concepts/feature-discoverability — Amundsen as the Feature-Store discovery layer; DAGs tag metadata automatically.
- concepts/training-serving-boundary — feature-store shape as a boundary-crossing discipline (unifies feature values across the training + serving fleets).
- patterns/hybrid-batch-streaming-ingestion — second canonical instance of the pattern (after Dropbox Dash).
- patterns/config-driven-dag-generation — canonical instance: SparkSQL + JSON config → auto-generated Airflow DAG with production-ready data-quality + Amundsen tagging baked in.
- patterns/batch-plus-streaming-plus-ondemand-feature-serving — the "platform of platforms" three-lane serving shape.
- patterns/wrapper-over-heterogeneous-stores-as-serving-layer
— canonical instance:
dsfeatureswraps DynamoDB + ValKey + OpenSearch behind one SDK.
LyftLearn 2.0 (2025-11-18 LyftLearn-evolution post)¶
- concepts/hybrid-ml-platform-architecture — compute-serving split (SageMaker for training, EKS/Kubernetes for serving).
- concepts/zero-code-change-migration + patterns/zero-code-change-platform-migration
- concepts/environmental-parity
- concepts/container-entrypoint-compat-layer + patterns/cross-platform-base-image
- patterns/runtime-fetched-credentials-and-config
- patterns/warm-pool-zero-create-path
- patterns/decoupled-compute-and-serving-stacks
- patterns/model-registry-and-object-store-as-hybrid-glue
- concepts/cross-cluster-networking
- concepts/lazy-container-image-loading (systems/amazon-soci)
Recent articles¶
- 2026-06-10 — sources/2026-06-10-lyft-metric-semantic-layer (Rohit Channe & Simran Mirchandani, Lyft Engineering — Lyft's internal Metric Semantic Layer (MSL): a centralized, versioned Python package serving as the single source of truth for every "Golden Metric" definition. YAML configs with Jinja-templated SQL, exposed via Python API, Amundsen integration, self-service UI, and MCP server for AI agents. Governance via dual-owner model (Business Owner + Operational Owner, always teams) with mandatory dual approval for changes. Only metrics with ≥2 use cases qualify. Fifth Lyft source on the wiki and the first focused on metric governance architecture.)
- 2026-04-23 — sources/2026-04-23-lyft-smarter-pickup-experience-for-gated-communities (Lyft Mapping team — an end-to-end rebuild of the pickup flow for gated communities, which make up 25–30% of Lyft rides in selected markets. Four-piece architecture: (1) gate-area shape generation from OSM + driver feedback; (2) dual inside/outside-gate pickup-spot selection UI with outside-gate spots sourced from historical pickup heatmaps; (3) routing inserts the gate as an invisible intermediate waypoint that doubles as a UX timing anchor; (4) intercom-style numpad for gate-code sharing + scannable banner on driver approach, with gate codes treated as ephemeral sensitive data (never stored between trips, audience of one, screenshot-blocked). ~95% positive rider survey response post-launch; lower rider + driver cancellation rates; less walking, shorter waits, fewer course changes. Named as the first instance of a repeatable playbook for physical-world constraints — generalises to road closures, unsafe curbs, etc. Fourth Lyft source on the wiki and the first with a mapping/routing focus.)
- 2026-01-06 — sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution
(Rohan Varshney, Lyft Engineering — Lyft's Feature Store as a
"platform of platforms" with three ingest lanes (batch /
streaming / direct-CRUD) converging on a unified online-serving
layer
dsfeaturesthat wraps DynamoDB + ValKey write-through LRU cache + OpenSearch embedding store behind two CRUD SDKs. Batch lane: SparkSQL + JSON config → auto-generated Astronomer-hosted Airflow DAGs with built-in data-quality checks and Amundsen metadata tagging. Streaming lane: customer Flink apps read Kafka/Kinesis → centralspfeaturesingestFlink ingest app writes todsfeatures. Uniform metadata + strongly consistent reads invariant across lanes. GSI-for-GDPR on DynamoDB. Second major-tech-co feature-store instance on the wiki after Dropbox Dash.) - 2025-11-18 — sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture (Lyft ML Platform team — LyftLearn 2.0: compute-serving split moving training / batch / HPO / JupyterLab off Kubernetes LyftLearn onto SageMaker-based LyftLearn Compute, while real- time model serving stays on EKS LyftLearn Serving. Zero- ML-code-change migration as the hard constraint; achieved via cross-platform Docker base image compat layer replicating the Kubernetes environment on SageMaker.)
- 2024-09-16 — sources/2024-09-16-lyft-protocol-buffer-design-principles-and-practices
(Roman Kotenko, Lyft Media — distilled two principles + five
practices for proto3 protobuf design: clarity + extensibility;
reserve
0asUNKNOWN; preferoneofover enum-plus-field; name fields with their unit; useoptionallabel / wrapper types for presence; declare validation inline with protoc-gen-validate; cross-entity constants via customEnumValueOptionsextensions. Validators must be invoked manually — they don't run on parse. First Lyft source on the wiki.)