High Scalability — Brief History of Scaling Uber¶
Summary¶
Retrospective by Josh Clemm (Senior Director of Engineering, Uber Eats — originally posted to LinkedIn, republished on High Scalability, 2024-03-14) walking the full architectural timeline of Uber from the 2009 LAMP stack prototype to the 2023+ cloud migration. Canonical tier-1 primary-lineage source for the wiki's Uber entity pages: systems/tchannel, systems/ringpop, Schemaless, M3, Jaeger, H3, RIBs, Cadence, Hudi, Michelangelo, Fulfillment Platform, Project Crane. Walks eight named architectural eras — 2009 LAMP, 2011 two-monolith dispatch/API split, 2013 SOA microservice explosion, 2014 Halloween Schemaless-on-MySQL rewrite, 2014 dispatch split into RTAPI gateway + geospatial-sharded dispatch service, 2016+ RIB mobile architecture, 2018 Project Ark language/repo consolidation, 2020 Edge Gateway four-layer stack, 2021+ Fulfillment Platform on Google Cloud Spanner, 2023+ Project Crane cloud-portability. The canonical real-world instance on this wiki of the monolith-to-microservices-and-back arc and the cautionary distributed monolith failure mode ("thousands of microservices, 12,000 code repos, 5–6 systems doing 75% the same thing").
Key takeaways¶
- 2024 Uber scale (framing). Largest mobility platform in the world: 70+ countries, 10,500 cities, 130M+ customers; Uber Eats is the largest food-delivery platform ex-China at 45+ countries. Runtime footprint: "billions of database transactions and millions of concurrent users across dozens of apps and thousands of backend services" on ~250,000 servers before the cloud migration. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2009 LAMP era, code written in Spanish. First version built by contractors on a classic LAMP stack; source code literally in Spanish. Couldn't scale — hit concurrency bugs like double-dispatch (two cars sent to one rider) and double-match (one driver matched to two riders). The prototype validated the product, not the architecture. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2011 two-monolith architecture — dispatch + API. Re-architected into dispatch (Node.js + MongoDB, later Redis) for real-time driver-location + matching, and API (Python/Tornado + PostgreSQL) for business logic (authentication, promotions, fare calculation). An Object Node ("ON") resilience layer sat between them to insulate dispatch from API failures. Uber was "one of the first major adopters of Node.js in production" — the non-blocking, single-threaded event loop + V8 performance made Node the right tool for real-time data-heavy dispatch. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2013 SOA: ~100 microservices by 2014. API monolith split along domain lines (rider billing, driver payouts, fraud detection, analytics, city management). To prevent the split from producing a distributed monolith, Uber built a supporting platform in-house: Clay (a Flask wrapper for consistent monitoring/logging/deployments) for service-framework standardization, TChannel over Hyperbahn as a bi-directional RPC protocol with fault-tolerance/rate-limiting/circuit-breaking built in, Apache Thrift for strong RPC interface contracts, Flipr for feature flags/rollout control, M3 for metrics observability (with Nagios for alerting), and Merckx → Jaeger for distributed tracing. Later migrated to gRPC + Protobuf and adopted Golang/Java. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2014 Halloween rewrite — the single-PostgreSQL bottleneck. By early 2014 Uber's single PostgreSQL was existential: performance/scalability/availability all capped by "how much memory and CPUs you could throw at it" and adding tables/indices/rows for new features was "problematic." Six months before Halloween — Uber's biggest traffic night of the year — the majority of the DB was trip data, also fastest-growing. Uber built Schemaless, an append-only sparse three-dimensional persistent hash map similar to Google's Bigtable, built on top of MySQL, with rows partitioned horizontally across shards. All services that touched trip data were migrated to Schemaless in time to avert the Halloween peak. Other workloads moved to Cassandra (notably the marketplace matching/dispatching DB). (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2014 dispatch split — RTAPI gateway + dispatch rewrite. The original dispatch monolith was doing two unrelated jobs: matching logic and proxying all mobile-app traffic to other microservices. Split surgically: RTAPI (Real-Time API) became a new Node.js gateway layer for mobile request routing — a single repository with multiple specialized deployment groups; the first Uber Eats was "completely developed within the gateway" before maturing out. Dispatch proper was rewritten as a series of services to handle vehicle-type variety and rider-need variety, including advanced matching as a traveling-salesman-variant optimization over current + near-future available drivers. A geospatial index was built using Google's S2 library with S2 cell IDs as the shard key — Uber later open-sourced H3 as the hexagonal successor. Because these services remained Node.js and stateful, Uber built Ringpop, a gossip-protocol-based library to share geospatial + supply positioning across dispatch nodes. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2016+ RIB mobile architecture. Mobile apps hit the same monolith bottleneck as backend: many features, many engineers, single releasable code base. Uber introduced the RIB (Router-Interactor-Builder) architecture starting with the main rider app rewrite. Like microservices, each RIB has clear separation of responsibilities, is ownable by separate teams, and can be feature-flagged independently. Core vs. optional RIBs have different review stringency — the core flows get stricter review for app-stability protection. RIBs scaled the mobile codebase to "hundreds of engineers" and have been adopted across the Driver app, Uber Eats apps, and Uber Freight. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- Uber Eats: a three-sided marketplace reuse story (2017). Launched Toronto late 2015, rapid growth mirrored UberX. Reused as much of Uber's tech stack as possible, adding only the e-commerce-specific services (carts, menus, search, browsing). Operations teams famously "did things that didn't scale" (manually tuning each city's marketplace, running scripts to toggle stores active/inactive, tuning delivery radius per driver-partner availability) until the appropriate tech was built. Today (2024): 130M+ users, dozens of countries, 0-N consumers × N merchants × 0-N driver-partners ordering modes (guest checkout, group ordering, multi-restaurant, merchant-supplied delivery fleet). (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2018 Project Ark — language + repo consolidation. After years of Uber's "Let Builders Build" cultural value and decentralized city-launch engineering, the org had thousands of microservices, 12,000 code repos, multiple solutions to common problems, and multiple programming languages for backend services. Former CTO Thuan Pham's canonical quote: "You've got five or six systems that do incentives that are 75 percent similar to one another." Developer productivity was hurting. Project Ark responded with: Java + Go as official backend languages (type-safety
- performance), Python/JavaScript deprecated for backend, 12,000 repos consolidated to monorepos per language (Java, Go, iOS, Android, web), standardized architectural layers (client, presentation, product, business logic), and a new service-"domain" abstraction to group related services. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2020 Edge Gateway — four-layer stack. RTAPI was showing its age (still Node.js + deprecated JavaScript, ad-hoc view generation and business logic inter-mixed). Replaced with a clean four-tier architecture: Edge Layer (pure API lifecycle management, no business logic) → Presentation Layer (microservices for view generation + data aggregation) → Product Layer (reusable functional APIs per product team) → Domain Layer (leaf-node microservices providing refined single-function capabilities). This restructured the 1000s of engineers' work without slowing product velocity. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2021+ Fulfillment Platform on Google Cloud Spanner. By 2021 Uber's dispatch/fulfillment stack was showing its age: the original monolithic matching-of-available-drivers design couldn't naturally support reservations (driver confirmed upfront), batching (multiple offered trips simultaneously), virtual airport queues, three-sided Uber Eats marketplace, or Uber Direct package delivery. Uber rewrote the Fulfillment Platform ground-up on a NewSQL architecture using Google Cloud Spanner — the first major Uber public bet on a cloud-native primary storage engine. Lead engineer Ankit Srivastava's framing: "as we scale and expand our global footprint, Spanner's scalability & low operational cost is invaluable. Prior to integrating Spanner, our data management framework demanded a lot of oversight and operational effort, escalating both complexity and expenditure." (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- Broader systems-lineage retrospective. Uber built and open-sourced many of the wiki's canonical primitives: H3 geospatial index (from S2 successor work), Cadence long-running workflow engine (predecessor to Temporal), Apache Hudi (low-latency CDC data pipelines), Michelangelo (ML platform), Presto-on-Kafka and scaled Spark for data infra, various Redis cache tiers. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
- 2023+ Project Crane — cloud migration. With 250,000+ servers across multiple on-prem data centers, Uber's tooling/teams couldn't keep up with fleet growth and geographic expansion. Uber spent years making ~4000 stateless microservices portable (per InfoQ 2023-10 summary) and built Project Crane as the next-gen infrastructure stack that works equally well on-prem or cloud. Plan: migrate a large portion of online + offline server fleet to cloud (Google + Oracle selected) over the next few years. (Source: sources/2024-03-14-highscalability-brief-history-of-scaling-uber)
Operational numbers¶
- 2024 scale: 70+ countries, 10,500 cities, 130M+ customers, 45+ Uber Eats countries.
- ~250,000 servers across on-prem DCs before cloud migration.
- Billions of database transactions / millions of concurrent users.
- 2014 turning point: 6 months from Halloween → need a new trip-data store. Schemaless-on-MySQL shipped + migrated in time.
- 2014 microservice count: ~100 backend services.
- 2018 microservice count: thousands; 12,000 code repos; "5 or 6 systems that do incentives that are 75% similar".
- 2019+ cloud prep: 4000+ stateless microservices made cloud-portable.
Architectural eras timeline¶
| Era | Architecture | Primary data store |
|---|---|---|
| 2009 LAMP | Single PHP/MySQL app, code in Spanish | MySQL |
| 2011 two-monolith | dispatch (Node.js) ↔ ON ↔ API (Python) | MongoDB → Redis / PostgreSQL |
| 2013 SOA | API split into ~100 Python/Tornado microservices | PostgreSQL |
| 2014 Halloween | Schemaless on MySQL; Cassandra for dispatch | MySQL (Schemaless) + Cassandra |
| 2014 dispatch split | RTAPI Node.js gateway + new dispatch | + Ringpop gossip, S2 shard key |
| 2016+ RIB | RIBs across rider/driver/Eats/Freight | — |
| 2018 Project Ark | Java + Go official; Python/JS deprecated for backend; monorepos | — |
| 2020 Edge Gateway | Four-layer Edge/Presentation/Product/Domain stack | — |
| 2021+ Fulfillment rewrite | NewSQL rewrite on Spanner | Spanner |
| 2023+ Project Crane | 4000+ stateless microservices made cloud-portable | Google + Oracle cloud |
Caveats¶
- Secondary telling. Original post is a LinkedIn article by Josh Clemm (Sr. Director of Engineering, Uber Eats), re-published on High Scalability with the author's permission. The retrospective is personal-vantage-point framing, not a formal Uber engineering blog post — absent primary-source caveats. Individual claims can be cross-checked against Uber's own blog posts, most of which Josh links (and which are cited from this source page's secondary landing points).
- No formal architecture diagrams from Uber internal sources. The post uses simplified block diagrams from Josh's LinkedIn deck — internal service-graph shapes are not disclosed.
- Cloud migration status as of the post (2024-03) is aspirational. "We now have plans to migrate a larger portion of our online and offline server fleet to the Cloud over the next few years." Treat as direction of travel, not completed state.
Source¶
- Original: https://highscalability.com/brief-history-of-scaling-uber/
- Raw markdown:
raw/highscalability/2024-03-14-brief-history-of-scaling-uber-a9268d37.md
Related¶
- companies/uber — Uber entity page, now with this ingest as its foundational source.
- companies/highscalability — aggregator of this article.
- concepts/monolith-vs-microservices-pendulum — the canonical arc; Uber's 2011→2018→2020 trajectory is the textbook instance.
- concepts/distributed-monolith — 2018 Project Ark's "5 or 6 systems doing 75% the same thing" is the canonical symptom description.
- patterns/two-monolith-architecture — 2011 dispatch+API split.
- patterns/geospatial-gossip-sharding — 2014 dispatch rewrite shape.
- patterns/layered-gateway-architecture — 2020 Edge Gateway.