Skip to content

HIGHSCALABILITY 2024-03-14 Tier 1

Read original ↗

High Scalability — A Brief History of Scaling Uber

Summary

Repost of a LinkedIn article by Josh Clemm, Senior Director of Engineering at Uber Eats, chronicling ~15 years of Uber's backend-architecture evolution from a 2009 LAMP prototype through a 2021+ NewSQL fulfillment rewrite and on-going cloud migration. The post is a tour rather than a deep-dive — it's a narrative index of the systems Uber built, open-sourced, and ultimately migrated off, which makes it uniquely useful as an anchor for the wiki's coverage of Uber technology (TChannel, Hyperbahn, Jaeger, M3, Schemaless, Ringpop, H3, RIBs, Cadence, Michelangelo, Hudi). Today's claims: 70+ countries, 10,500 cities, 130M+ customers, billions of DB transactions, millions of concurrent users, dozens of apps, thousands of backend services, 250,000+ on-prem servers, 4,000+ stateless microservices made cloud-portable via Project Crane.

Key takeaways

  1. Original (2009) stack was LAMP with Spanish-language source code — built by contractors, launched in SF. Suffered concurrency bugs at ride-dispatch time (two cars dispatched to one rider; one driver matched to two riders), forcing the 2011 rewrite (Source: this article).
  2. The 2011 rewrite settled on a two-monolith architecture: dispatch (Node.js + MongoDB, later Redis) for real-time driver/rider state, and API (Python + PostgreSQL) for business logic (auth, promos, fare calc). A layer called "Object Node" (ON) sat between dispatch and API to shield dispatch from API outages — an early resilience boundary. Uber was "one of the first major adopters of Node.js in production" (Source: this article, citing OpenJS's Nodejs-at-Uber paper).
  3. 2013: Python API split into ~100 microservices by 2014 (eventually thousands by 2018). To counter the operational complexity of microservices, Uber built an in-house platform: Clay (Python/Flask wrapper for standardized services), TChannel over Hyperbahn (in-house bidirectional RPC for Node/Python perf), Apache Thrift (RPC contracts), Flipr (feature flags / config), M3 (metrics, Grafana-visible), Nagios (alerting), Merckx → Jaeger (distributed tracing, Zipkin-inspired, systems/jaeger still in use today). Later migrated to gRPC + Protobuf; languages converged to Go and Java (Source: this article).
  4. 2014 Halloween crisis: single PostgreSQL was the 2014 scaling bottleneck. Trip data dominated storage and grew fastest. Uber built Schemaless — an append-only sparse 3D persistent hash map built on top of MySQL, Bigtable-shaped, horizontally sharded — and migrated off Postgres to MySQL in time for Halloween peak. Cassandra adopted as general-purpose store (incl. marketplace matching/dispatch) (Source: this article, citing Uber engineering posts + Rene Schmidt's Schemaless talk).
  5. 2014: Dispatch split into RTAPI + new dispatch. RTAPI (still Node.js) became the real-time mobile gateway; a single monorepo'd into multiple deployment groups. The new dispatch service added vehicle/need matching as a traveling-salesman-problem variant with a geospatial index — originally Google S2 cells as shard key, later H3 (systems/h3-geo, open-sourced). Because dispatch was stateful and on Node.js, Uber built systems/ringpop (gossip-protocol-based sharding + membership) to scale it horizontally (Source: this article).
  6. 2016+: RIB architecture for mobile. RIBs (Router, Interactor, Builder) are the mobile equivalent of microservices — single-responsibility, core-vs-optional separation, team ownership. First rewrite: the main rider app; now all Uber apps (Driver, Eats, Freight) are on RIBs (Source: this article).
  7. 2017: Uber Eats launched in Toronto 2015, reusing existing Uber tech. Eats today supports 130M+ users, 45+ countries, and 0–N consumers, N merchants, 0–N driver-partners (guest checkout, group order, multi-restaurant, restaurant-own-fleet). Early Eats ops teams did things that don't scale (scripts to tune store-radius per driver supply) until the platform caught up (Source: this article, citing Jason Droege's Building Uber Eats podcast).
  8. 2018: Project Ark — the platform-consolidation pendulum swing. By 2018, Uber had thousands of microservices, 12,000 code repos, 5–6 systems for incentives alone (per former CTO Thuan Pham), many messaging queue / DB / language choices. Ark elevated Java + Go as official backend languages, deprecated Python and JavaScript for backend, consolidated to a handful of monorepos (Java, Go, iOS, Android, web), and introduced architectural layers (client / presentation / product / domain) plus service domains as microservice groupings (Source: this article, citing Uber's microservice-architecture blog post).
  9. 2020: Edge Gateway replaced RTAPI with four formal tiers: Edge Layer (API lifecycle only), Presentation Layer (view-generation + aggregation), Product Layer (reusable product APIs), Domain Layer (single-functionality leaf services). Clean-up of RTAPI's accumulated ad-hoc view/business logic (Source: this article, citing Uber's gatewayuberapi post).
  10. 2021: Fulfillment Platform rewrite onto NewSQL / Google Cloud Spanner. The old dispatch/fulfillment stack couldn't cleanly model reservations, virtual airport queues, batching (multiple trips offered simultaneously), Uber-Direct packages, or the three-sided Eats marketplace. Requirements: transactional consistency + horizontal scalability + low operational overhead → Spanner. Quote from lead engineer Ankit Srivastava: "Prior to integrating Spanner, our data management framework demanded a lot of oversight and operational effort, escalating both complexity and expenditure." (Source: this article, citing Uber's building-ubers-fulfillment-platform post + GCP price-performance announcement).
  11. Supporting infrastructure named in passing: Cadence (fault-tolerant long-running workflows), Apache Hudi (open-sourced at Uber, powers business-critical data pipelines at low latency/high efficiency), Presto on Kafka, Spark at scale, Michelangelo (ML platform), Gairos (real-time data intelligence), various Redis caches, server-performance work (one cited: "saved 70k cores across 30 mission-critical services") (Source: this article).
  12. Next phase: on-prem → cloud. Uber ran >250,000 on-prem servers from the earliest days across multiple on-prem DCs. Tooling and fleet-sizing struggled to keep up; geographic expansion was painful. Project Crane made 4,000+ stateless microservices portable across on-prem and cloud. Uber plans to migrate "a larger portion of our online and offline server fleet to the Cloud over the next few years" (Google + Oracle, per the Forbes 2023 piece cited) (Source: this article).

Operational numbers

What Number Era Cited
Cities (mobility) 10,500 today 2022 Investor Day deck
Countries (mobility) 70+ today 2022 Investor Day deck
Customers 130M+ today 2022 Investor Day deck
Countries (Eats) 45+ today 2022 Investor Day deck
Cities 600 2018 Project Ark framing
Microservices ~100 2014 SOA migration
Microservices "thousands" 2018 Project Ark framing
Code repositories 12,000 pre-Ark (2018) Thuan Pham era
Incentive systems 5-6 pre-Ark former CTO quote
On-prem servers 250,000+ pre-Crane Project Crane framing
Stateless microservices made cloud-portable 4,000+ Project Crane InfoQ 2023-10 piece
Cores saved via memory/system tuning 70,000 across 30 services Uber blog citation
Uber Eats users 130M+ today Q3 2020 earnings + this piece

Architectural diagrams / images referenced

The post includes eight architecture diagrams in the High Scalability rendering:

  1. Original LAMP stack (2009).
  2. Two-monolith architecture (dispatch Node+MongoDB/Redis ↔ ON ↔ API Python+Postgres).
  3. API monolith → microservices (2013–14).
  4. Single Postgres DB data mix (2014) — trip data dominates.
  5. Schemaless migration "operational room" photo.
  6. Splitting monolithic dispatch into real-time API gateway (RTAPI) + actual dispatch (2014).
  7. RTAPI multi-deployment-group layout.
  8. Dispatch stack overview (incl. S2/H3 geospatial index, Ringpop gossip).
  9. RIB architecture diagram (Router-Interactor-Builder).
  10. Early Uber Eats architecture reusing original Uber systems.
  11. Uber Eats Canada "radius-tuning" ops script screenshot.
  12. Fulfillment platform use cases and rewrite diagram.
  13. Modern Edge Gateway four-layer stack (Edge / Presentation / Product / Domain).
  14. Uber-GCP network infrastructure (for Spanner).

Caveats

  • Narrative, not internals. This is a retrospective overview by a senior leader; it names many systems but describes few of them in depth. Almost every bullet above links to a dedicated Uber engineering post — the wiki should treat those as the authoritative sources for each system, and the current post as the map / anchor index.
  • Two different 2014 migrations get conflated in shorter retellings: (a) Postgres → MySQL for trip data via Schemaless, and (b) Cassandra replacing Postgres for marketplace-matching. They happened simultaneously but are distinct migrations to distinct stores.
  • Schemaless is on MySQL, not a replacement for MySQL. It's an application-level sparse 3D hash-map indexed by row-key, column, and version, stored in sharded MySQL tables.
  • Ringpop is Node.js-era infrastructure. Later Go services use different sharding mechanisms; Ringpop is mentioned in archival/historical context in most current Uber posts.
  • Project Ark framing is post-hoc. The article presents Ark as a single program; in practice standardization spanned years and many sub-efforts. The Stanford GSB case study cited ("Uber: Repaying Technical and Organizational Debt") is the authoritative reference.
  • Uber operates hybrid on-prem + cloud today, not pure cloud — the Forbes piece cited explicitly covers the debate around cloud vs on-prem.
  • No depth on Eats marketplace matching, money stack, geospatial platform, or Michelangelo internals — Josh Clemm explicitly flags Eats scaling as "probably deserves its own scaling story" for future work.

Source

Last updated · 319 distilled / 1,201 read