CONCEPT Cited by 1 source
Microservices migration¶
A microservices migration is the organisational and architectural transition from a monolithic codebase + shared database deployment model to many independently-deployed services owned by small teams. The migration is rarely a purely technical exercise — it cascades into team structure, on-call, release cadence, and observability requirements.
The cascading costs¶
Zalando's 2016 retrospective names the second-order costs directly. The move to the cloud and microservices meant:
- Teams become end-to-end responsible — design, develop, test, deploy, and operate. The majority of engineers were not experienced in the "operate" half. (Source: )
- Service count grows faster than team headcount. The existing on-call rotation (5 teams for the monolith fleet) could not scale.
- Low standardisation across services — monitoring, alerting, deployment, and runtime conventions diverge per team. Responders now cover services they've never touched.
- Reliability gaps materialise — timeout defaults, circuit-breaker policy, retry strategy, overload handling all become per-team decisions made (often badly) under time pressure.
Why it drives SRE adoption¶
These second-order costs are what make microservices migrations a typical driver of SRE organizational evolution. Pre-migration, reliability can live implicitly in the monolith's shape + a small ops team. Post-migration, reliability is distributed across N teams and needs explicit investment — either as embedded SRE, shared primitives, or a dedicated department.
Seen in¶
- — 2016 Zalando migration as backdrop for the first SRE attempt.
- sources/2024-03-14-highscalability-brief-history-of-scaling-uber — the canonical large-scale migration case study on the wiki. Uber's 2013 API monolith split grew to ~100 microservices by 2014 and thousands by 2018. Platform primitives built to support the split (TChannel, Clay, M3, Jaeger) were the explicit counter-measure to the distributed-monolith drift that still materialised by 2018 — forcing Project Ark language consolidation.
- — Zalando Returns team extraction as the canonical bounded- context-from-monolith case on the wiki. Names the migration constraints that rule out test-driven-alone approaches (untested code, "dark corners", zero-downtime requirement) and canonicalises parallel run as the behavioural-equivalence discipline for such cases. Per-endpoint readiness threshold (concepts/consistency-threshold-per-endpoint) + Skipper proxy rule swap (patterns/gradual-per-endpoint-cutover) + async comparison sidecar (patterns/async-consistency-checker-sidecar) combine into the migration mechanic; cleanup removes ~700 LOC + ~1.3k LOC of tests post-cutover.
Related¶
- concepts/you-build-it-you-run-it
- concepts/on-call-rotation
- concepts/sre-organizational-evolution
- concepts/distributed-monolith — the failure mode microservices migrations risk producing.
- concepts/service-oriented-architecture — the pattern the migration implements.
- patterns/language-consolidation-for-fleet-standardization — the corrective that Uber applied when the migration had over-sprawled.
- companies/zalando