Lyft¶

Lyft Engineering blog is a Tier-2 source on the sysdesign-wiki. Lyft runs a large Envoy-fronted ridesharing platform (Envoy itself originated at Lyft), heavy polyglot backend (Python + Go + Java) with iOS + Android mobile clients, and the usual infra surface — service meshes, feature flags, rider/driver matching, trip lifecycle state. The blog's historical strength has been systems-at-scale content (Envoy, Flyte, service-mesh operations) plus mobile networking and protocol design for mobile-to-server + server-to-server communication. The 2020 post by Michael Rebello on Lyft's mobile networking journey documents the company-wide adoption of protobuf for mobile traffic; the 2024-09-16 Lyft Media post that's first ingested here is a direct descendant of that adoption.

The 2025-11-18 and 2026-01-06 posts opened a second Lyft theme on the wiki: ML platform + data infra. LyftLearn 2.0 is the ML platform (split compute from serving — SageMaker for training, Kubernetes/EKS for inference). The Lyft Feature Store is the data substrate underneath both — a "platform of platforms" with three ingestion lanes (batch / streaming / direct-CRUD) converging on dsfeatures, a wrapper over DynamoDB + ValKey + OpenSearch.

The 2026-04-23 post opened a third Lyft theme: Mapping + pickup UX. Lyft's Mapping team covers map data, pickup-spot recommendations, routing, and the rider/driver app surfaces — four pieces that the "Smarter Pickup Experience for Gated Communities" project weaves together into a single named playbook for encoding real-world physical constraints into the map. Gated communities make up 25–30% of rides in selected markets; the fix generalises explicitly to road closures, unsafe curbs, parades, and marathons — same four-step pattern applied to a different spatial constraint.

Key systems¶

Metric governance (MSL, 2026-06-10 post)¶

systems/lyft-metric-semantic-layer — Metric Semantic Layer; a centralized, versioned Python package serving as the single authoritative repository for every "Golden Metric" definition. YAML configs + Jinja-templated SQL + Python API + MCP server. Integrated with Amundsen for discoverability and a self-service Metric UI for no-code SQL generation.

Mapping / pickup UX (gated-community pickup, 2026-04-23 post)¶

systems/lyft-gate-area-generator — Map Data team algorithm that generates gate-area polygons for gated communities from OpenStreetMap + driver feedback. Handles single-entrance apartment complexes through multi-gate developments with internal road networks. Feeds the rider app's "gates mode" auto-detect on app open.
systems/lyft-pickup-routing — Routing team's pickup-routing subsystem. Inserts the gate as an invisible intermediate waypoint for gated-community pickups, giving the driver app a precise UX timing anchor for surfacing gate instructions.
systems/lyft-rider-app — rider-facing mobile app; host of the "gates mode" dual inside/outside-gate pickup-spot selection UI and the intercom-style numpad + plain-language list for gate-instruction sharing.
systems/lyft-driver-app — driver-facing mobile app; host of the scannable gate-instruction banner timed off the routing waypoint, with screenshot prevention for code exfiltration control.

Protocol + schema design (mobile + backend)¶

systems/protobuf — Lyft Media canonicalises design practices for the shared-schema case (mobile + backend). Used extensively across both mobile-to-server and server-to-server traffic per Rebello's 2020 post.
systems/protoc-gen-validate (PGV) / protovalidate — Lyft Media's declarative validation layer over protobuf schemas. Plugin author is not Lyft but Lyft Media uses it as the standard validator.
systems/envoy — Lyft-originated L7 proxy; pre-existing wiki reference.

ML platform¶

systems/lyftlearn / systems/lyftlearn-serving / systems/lyftlearn-compute — Lyft's ML platform, split after LyftLearn 2.0 into Kubernetes/EKS serving (lyftlearn-serving) and SageMaker-based training/batch/notebooks (lyftlearn-compute).

ML data infra (feature store)¶

systems/lyft-feature-store — "platform of platforms": three ingestion lanes (batch / streaming / direct-CRUD) with strongly-consistent reads + uniform metadata across lanes. Canonical second-major-co feature-store on the wiki after Dropbox Dash.
systems/lyft-dsfeatures — the unified online serving layer; wraps DynamoDB (persistent, GSI for GDPR deletion) + ValKey (write-through LRU cache) + OpenSearch (embeddings only). Exposes full CRUD via go-lyft-features + lyft-dsp-features SDKs.
systems/amundsen — Lyft's own open-source data-discovery platform; Feature Store DAGs automatically tag feature metadata here so engineers can find existing features before creating duplicates.
systems/apache-airflow — Astronomer-hosted; runs the auto-generated feature DAGs.
systems/apache-flink — streaming-feature lane, reading from Kafka (or sometimes Kinesis) and writing through the central spfeaturesingest Flink choke-point app.
systems/apache-spark + systems/apache-hive — SparkSQL as the batch-feature transformation language; Hive as the offline feature store.

Key patterns / concepts¶

Metric governance (2026-06-10 MSL post)¶

concepts/metric-definition-drift — the anti-pattern MSL solves: different teams maintain different SQL for the same metric
concepts/metric-definition-as-code — treating metric SQL as versioned, package-distributed code
concepts/golden-metric-selection-criteria — the ≥2 use-case threshold for onboarding metrics into MSL
concepts/dual-owner-metric-governance — Business Owner + Operational Owner model; teams, never individuals
patterns/yaml-config-driven-metric-definitions — YAML + Jinja templates for DRY metric SQL generation
patterns/jinja-templated-sql-generation — parameterized SQL via Jinja for time granularity + dimensions
patterns/dual-owner-approval-for-metric-changes — mandatory dual sign-off as governance gate
patterns/metric-semantic-layer-as-ai-knowledge-base — structured YAML definitions as grounding context for MCP-based AI agents

Mapping / pickup UX (2026-04-23 gated-community post)¶

concepts/gated-community-pickup — the problem shape (25–30% of rides in some markets; dual root-cause of inflexible spot selection + communication black hole).
concepts/pickup-spot-recommendation — the broader ride-share problem of picking the right spot to meet.
concepts/virtual-waypoint-routing — using an invisible intermediate stop on a driver's route as a hook for surfacing context at the right moment, not as a real stop.
concepts/historical-pickup-heatmap — using where past riders actually met drivers as the recommendation signal over topology.
concepts/timing-based-information-surfacing — show information at the moment it's actionable, not earlier; driver-safety-motivated.
concepts/ephemeral-sensitive-data — gate codes as the canonical instance: never stored between trips, audience of one, screenshot-blocked.
concepts/ride-cancellation-rate — primary business metric for the gated-community pickup fix.
concepts/rider-driver-communication-black-hole — anti-pattern of last-mile coordination falling back to ad-hoc text/call because the platform offers no proactive context channel.
patterns/intermediate-waypoint-for-context-surfacing — insert a virtual routing waypoint so the UX has a timing anchor; routing is the enabler, UX timing is the purpose.
patterns/historical-usage-for-pickup-spot-suggestion — surface meet-point recommendations from historical successful-match data.
patterns/map-encoded-real-world-constraint — the four-step playbook Lyft explicitly names as reusable (encode in map → surface in recs → thread through routing → surface at right moment). Generalises from gates to road closures to unsafe curbs.
patterns/ephemeral-per-trip-sensitive-input — the per-trip privacy contract for the gate-code channel.
patterns/controlled-experiment-before-shipping — A/B test primary metric should be upstream-funnel completion (ride requests), not the feature's own metric.
patterns/familiar-ui-borrowed-from-adjacent-flow — intercom-numpad borrows from physical intercoms + Lyft's existing Venues pickup flow to drop learning cost.

Protobuf / protocol design¶

concepts/clarity-over-efficiency-in-protocol-design — first of Lyft Media's two named principles for protobuf design
concepts/extensibility-protocol-design — second principle; prefer structures (oneof, well-known types, string IDs) that admit future additions
concepts/unknown-zero-enum-value — reserve 0 as UNKNOWN on every enum
concepts/unit-suffix-field-naming — payload_size_bytes, timestamp_ms_utc, not raw primitives
concepts/proto3-explicit-optional — use optional label (proto3 ≥ 3.15) or google.protobuf.*Value wrappers for presence semantics on primitives
patterns/oneof-over-enum-plus-field — model variant messages with oneof, not with discriminator-enum + sibling optional fields
patterns/protobuf-validation-rules — declarative validation inline in the .proto; generated validators must be invoked explicitly
patterns/protobuf-cross-entity-constants — custom EnumValueOptions extensions to share literal constants between mobile and backend

ML feature store (2026-01-06 Feature Store post)¶

concepts/feature-store — Lyft is the second canonical major-tech-co instance.
concepts/feature-freshness — batch / streaming / on-demand lanes map to different freshness tiers; "ultra-low-latency" cache + "near-real-time" streaming path.
concepts/write-through-cache — canonical example: ValKey over DynamoDB inside dsfeatures.
concepts/feature-discoverability — Amundsen as the Feature-Store discovery layer; DAGs tag metadata automatically.
concepts/training-serving-boundary — feature-store shape as a boundary-crossing discipline (unifies feature values across the training + serving fleets).
patterns/hybrid-batch-streaming-ingestion — second canonical instance of the pattern (after Dropbox Dash).
patterns/config-driven-dag-generation — canonical instance: SparkSQL + JSON config → auto-generated Airflow DAG with production-ready data-quality + Amundsen tagging baked in.
patterns/batch-plus-streaming-plus-ondemand-feature-serving — the "platform of platforms" three-lane serving shape.
patterns/wrapper-over-heterogeneous-stores-as-serving-layer — canonical instance: dsfeatures wraps DynamoDB + ValKey + OpenSearch behind one SDK.

LyftLearn 2.0 (2025-11-18 LyftLearn-evolution post)¶

concepts/hybrid-ml-platform-architecture — compute-serving split (SageMaker for training, EKS/Kubernetes for serving).
concepts/zero-code-change-migration + patterns/zero-code-change-platform-migration
concepts/environmental-parity
concepts/container-entrypoint-compat-layer + patterns/cross-platform-base-image
patterns/runtime-fetched-credentials-and-config
patterns/warm-pool-zero-create-path
patterns/decoupled-compute-and-serving-stacks
patterns/model-registry-and-object-store-as-hybrid-glue
concepts/cross-cluster-networking
concepts/lazy-container-image-loading (systems/amazon-soci)

Recent articles¶

2026-06-10 — sources/2026-06-10-lyft-metric-semantic-layer (Rohit Channe & Simran Mirchandani, Lyft Engineering — Lyft's internal Metric Semantic Layer (MSL): a centralized, versioned Python package serving as the single source of truth for every "Golden Metric" definition. YAML configs with Jinja-templated SQL, exposed via Python API, Amundsen integration, self-service UI, and MCP server for AI agents. Governance via dual-owner model (Business Owner + Operational Owner, always teams) with mandatory dual approval for changes. Only metrics with ≥2 use cases qualify. Fifth Lyft source on the wiki and the first focused on metric governance architecture.)
2026-04-23 — sources/2026-04-23-lyft-smarter-pickup-experience-for-gated-communities (Lyft Mapping team — an end-to-end rebuild of the pickup flow for gated communities, which make up 25–30% of Lyft rides in selected markets. Four-piece architecture: (1) gate-area shape generation from OSM + driver feedback; (2) dual inside/outside-gate pickup-spot selection UI with outside-gate spots sourced from historical pickup heatmaps; (3) routing inserts the gate as an invisible intermediate waypoint that doubles as a UX timing anchor; (4) intercom-style numpad for gate-code sharing + scannable banner on driver approach, with gate codes treated as ephemeral sensitive data (never stored between trips, audience of one, screenshot-blocked). ~95% positive rider survey response post-launch; lower rider + driver cancellation rates; less walking, shorter waits, fewer course changes. Named as the first instance of a repeatable playbook for physical-world constraints — generalises to road closures, unsafe curbs, etc. Fourth Lyft source on the wiki and the first with a mapping/routing focus.)
2026-01-06 — sources/2026-01-06-lyft-feature-store-architecture-optimization-and-evolution (Rohan Varshney, Lyft Engineering — Lyft's Feature Store as a "platform of platforms" with three ingest lanes (batch / streaming / direct-CRUD) converging on a unified online-serving layer dsfeatures that wraps DynamoDB + ValKey write-through LRU cache + OpenSearch embedding store behind two CRUD SDKs. Batch lane: SparkSQL + JSON config → auto-generated Astronomer-hosted Airflow DAGs with built-in data-quality checks and Amundsen metadata tagging. Streaming lane: customer Flink apps read Kafka/Kinesis → central spfeaturesingest Flink ingest app writes to dsfeatures. Uniform metadata + strongly consistent reads invariant across lanes. GSI-for-GDPR on DynamoDB. Second major-tech-co feature-store instance on the wiki after Dropbox Dash.)
2025-11-18 — sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture (Lyft ML Platform team — LyftLearn 2.0: compute-serving split moving training / batch / HPO / JupyterLab off Kubernetes LyftLearn onto SageMaker-based LyftLearn Compute, while real- time model serving stays on EKS LyftLearn Serving. Zero- ML-code-change migration as the hard constraint; achieved via cross-platform Docker base image compat layer replicating the Kubernetes environment on SageMaker.)
2024-09-16 — sources/2024-09-16-lyft-protocol-buffer-design-principles-and-practices (Roman Kotenko, Lyft Media — distilled two principles + five practices for proto3 protobuf design: clarity + extensibility; reserve 0 as UNKNOWN; prefer oneof over enum-plus-field; name fields with their unit; use optional label / wrapper types for presence; declare validation inline with protoc-gen-validate; cross-entity constants via custom EnumValueOptions extensions. Validators must be invoked manually — they don't run on parse. First Lyft source on the wiki.)