Skip to content

ZALANDO 2021-11-03 Tier 2

Read original ↗

Zalando — Parallel Run Pattern: A Migration Technique in Microservices Architecture

Summary

Zalando's Returns team walks through extracting returns logic from a legacy monolithic application into a new Returns microservice using the Parallel Run pattern (Sam Newman, Monolith to Microservices). The pattern: for every incoming request, call both the old and new implementations, compare their outputs, and keep the old system authoritative until verified consistency makes the new system trustworthy. Zalando's implementation keeps the monolith on the hot path (responds to the client first) and fires an async POST to a /consistency-checks endpoint on the new service; the new service then re-issues the request against its own real endpoint and compares its response against the monolith's response along three axes (HTTP status, headers, body). Metrics (Matched / Unmatched / Failed) are emitted to systems/prometheus and displayed in systems/grafana per operation_id, with each endpoint getting its own consistency threshold. Cutover happens one endpoint at a time via Skipper proxy rule changes — no redeploys, so rollback is a proxy-config revert. Cleanup removed ~700 lines of code + ~1.3k lines of tests. The post enumerates six advantages (live-data testing, gradual rollout, incremental dev, easy rollback, bug-finding in the legacy, load characterisation) and six concerns (request doubling, comparison refinements, GDPR, non-trivial comparisons, non-idempotent endpoints, not-a-quick-win).

Key takeaways

  • Parallel Run is the canonical migration pattern when tests alone cannot guarantee behavioural equivalence. Legacy code without tests + dark-corner behaviour + downtime intolerance pushed the Returns team past the testing-alone envelope. "In scenarios like decomposing a monolith or replacing a legacy component with a newer one, testing might not be enough. Furthermore, there are always dark corners in our systems that we have never tested or we don't know their behavior." (Source: sources/2021-11-03-zalando-parallel-run-pattern-a-migration-technique-in-microservices)

  • Only one system is the source of truth at any given time. Sam Newman quoted verbatim: "Despite calling both implementations, only one is considered the source of truth at any given time. Typically, the old implementation is considered the source of truth until the ongoing verification reveals that we can trust our new implementation." The monolith keeps authority; the microservice is the candidate.

  • Async /consistency-checks endpoint decouples comparison cost from client latency. After the monolith responds (steps 1–2), it POSTs to /consistency-checks on the Returns microservice, which immediately returns 202 Accepted and processes the comparison in the background. Client never waits; monolith resources free up.

  • The comparison payload carries the full original request + the monolith's response. JSON schema: request.url.path with query params, request.headers, request.method, request.body — plus response.status, response.headers, response.body. The microservice re-issues against localhost and diffs the two responses along three axes.

  • Three-axis response diff — HTTP status, headers, body. Unmatched cases: "Different HttpStatuses: such as 2xx and 4xx or even 201 and 200. Different Headers set: a missing header in one of the two responses or different values for the same header. Different Body responses: missing fields/attributes in the responses or different values for the same field/attribute."

  • Matched / Unmatched / Failed counter triad (emitted to Prometheus, displayed in Grafana per operation_id). Failed is the 5xx-class terminating-before-comparison bucket — "even if they matched it would not be a valuable information given that the request couldn't be properly fulfilled due to a transient server-side issue."

  • Per-endpoint consistency threshold — each endpoint gets its own target percentage. "Each endpoint, defined by an operation_id, had its own metric and its own tolerance. This was done because, as usual, fixing those last few percentages has a cost higher than the value it brings; given that each endpoint is completely separated from one another, each endpoint had its own target percentage to consider it consistent (enough)."

  • Rollout is proxy-driven + per-endpoint. Zalando fronts the new Returns service with Skipper (Zalando's open-source HTTP proxy) and cuts over one endpoint at a time. "By using a proxy to do the traffic switch, rolling back just requires a change to the proxy to migrate the endpoint back to use the previous host instead of the microservice one; this avoids the need of redeploying, making the whole process faster."

  • Cleanup scope is substantial. The parallel-run scaffolding (consistency-check handler, localhost gateway, domain-model for consistency logic, feature toggle, tests) removes ~700 lines of production code + ~1.3k lines of unit + component tests once migration completes. This is real investment, not free instrumentation.

  • Request doubling is the headline cost. Load across all components "increases, potentially doubling." Allocations for the new service must plan for full production traffic plus the async-comparison overhead. Zalando treats this as an expected operational expense, not a surprise.

  • Non-idempotent endpoints break the pattern unless carefully handled. Verbatim: "For example this approach can be used for POSTs that are idempotent but not when the idempotency of the endpoint cannot be guaranteed. When doing this investigation always consider idempotency of each operation and possible side effects (for example calling another POST api, updating a database, or publishing an event)."

  • GDPR-sensitive comparisons require explicit handling. "While collecting the data for the comparison we need to keep into account that sensitive information should either not be stored or cleaned afterwards." Non-storage or post-comparison cleanup. Downside: investigating inconsistencies on fields containing personal data becomes difficult.

  • Some comparison axes are ignored intentionally. Not all headers are relevant to the outcome (e.g., Date, X-Request-Id, transfer-encoding). "In the comparison check not everything needs to match 100%." The comparator is a tuning surface, not a strict equality gate.

  • Non-trivial comparisons are common. Zalando names: comparing PDFs (metadata-sensitive), HTTP-framework change (different default response headers), collection ordering differences. These become per-endpoint comparator customisations.

  • Six advantages canonicalised. (1) Live data for testing — real production use cases exercise the new service. (2) Gradual rollout — per-endpoint. (3) Incremental development — can develop per-endpoint to match the rollout granularity. (4) Easy rollback — proxy-rule revert, no redeploy. (5) Finding bugs in the legacy — real-data comparison sometimes reveals that the monolith was wrong. (6) Load testing — performance characteristics of the new stack under realistic traffic emerge naturally, helping set realistic SLOs before going live.

  • Verdict — not for every migration. Sam Newman quoted again: "Implementing a parallel run is rarely a trivial affair, and is typically reserved for those cases where the functionality being changed is considered to be high risk." Zalando's Returns team agrees: "The parallel run pattern is a powerful technique to overcome the complexities and stress of migration projects, but not every migration project is a match to use this pattern."

Systems named

  • systems/zalando-returns-monolith — the legacy monolithic application the Returns team was extracting from (unnamed internally; referred to in the post as "a soon-to-be legacy monolithic application").
  • systems/zalando-returns-service — the new Returns microservice, with three named architectural layers: use-cases layer (consistency-check handler), gateway layer (localhost gateway for re-issuing the request), entities layer (domain model for consistency logic).
  • Skipper — the Zalando-developed open-source HTTP proxy used to do the per-endpoint cutover.
  • systems/prometheus — emitted metrics (Matched, Unmatched, Failed counters per operation_id).
  • systems/grafana — dashboards per-endpoint displaying the match-rate percentages.

Concepts extracted

Patterns extracted

Operational numbers

Metric Value
Production code removed at cleanup ~700 lines
Test code removed at cleanup ~1,300 lines (unit + component)
Request volume multiplier during parallel run ~2× ("potentially doubling")
Cutover primitive One Skipper route rule per endpoint
Async ack code HTTP 202 Accepted
Comparison axes 3 (status, headers, body)
Metric triad Matched / Unmatched / Failed per operation_id
Metrics backend Prometheus
Dashboard backend Grafana

No p99 latency figures, no match-percentage targets, no request volume absolute numbers disclosed. Consistency-check processing time is not quoted.

Caveats

  • No absolute request volume given. The post doesn't disclose Returns-service RPS, so the "~2×" cost is qualitative.
  • No disclosed match-rate targets. Each endpoint's consistency threshold is acknowledged to differ, but no examples are given (e.g., "95% for endpoint A, 99.9% for endpoint B").
  • No timeline given. How long the parallel run ran before cutover, and how long cutover-then-cleanup took, are not quantified.
  • Comparator code not shown. The custom logic for ignoring irrelevant headers, handling PDF metadata, and normalising collection order is described at category level only.
  • GDPR handling is mentioned but not mechanised. "Should either not be stored or cleaned afterwards" — the concrete redaction pipeline isn't shown.
  • Non-idempotency mitigation is only named, not solved. The post flags the constraint but does not offer a pattern for handling non-idempotent endpoints under parallel run (e.g., shadow mode with side-effect stubs, request replay with idempotency keys, or endpoint-level opt-out).
  • Post is advisory retrospective, not a deep mechanism disclosure. Architecture density is ~70% — the rest is narrative framing + Sam Newman quotes + advantages/considerations lists. Tier-2 scope rationale: the pattern articulation + the /consistency-checks async mechanism + per-endpoint Skipper cutover story + concrete cleanup numbers are the wiki-load-bearing contributions.

Source

Last updated · 550 distilled / 1,221 read