Skip to content

PATTERN Cited by 1 source

Blue/green service-mesh migration

Blue/green service-mesh migration is the pattern for cutting traffic between two disjoint service-mesh environments that cannot share networking — typically because each mesh has its own service-discovery implementation, its own proxy data plane, and no cross-mesh protocol for routing one request between them. You stand up the new mesh ("green") as a parallel copy of the production application alongside the old mesh ("blue"), then shift traffic at the edge (DNS / CDN / ALB) from blue to green over time.

The canonical driver in this wiki is the 2025-01-18 AWS guide for migrating AWS App MeshAmazon ECS Service Connect. The AWS constraint is explicit: "An Amazon ECS Service can't simultaneously be part of both an App Mesh Mesh and a Service Connect Namespace." So in-place migration of a running Service is impossible — recreate under the new mesh, then cut traffic.

Shape

  1. Stand up green. Recreate every microservice (or the subset being cut over today) inside the new mesh. Each microservice now exists twice: once in blue (old mesh), once in green (new mesh).
  2. Do NOT connect the two meshes. Each mesh has its own service discovery + data plane; there is no shared-networking plane. A user session that enters blue stays in blue; one that enters green stays in green. This is load-bearing.
  3. Shift traffic at the edge. The three canonical mechanisms (from the AWS guide):
  4. DNS weighted recordsRoute 53 multi-record weighted responses; adjust integer weights to shift resolution proportion. Slow propagation (TTL-bounded) but works across any environment.
  5. CDN continuous deployment — CloudFront continuous-deployment primary/staging distributions; faster propagation than DNS; only applies to CDN-fronted traffic.
  6. ALB multi-target-group routing — forward-action rules with weighted target groups on a single Application Load Balancer; tightest per-request control; scope limited to one ALB.
  7. Monitor using the new mesh's observability. Service Connect publishes free CloudWatch app-level metrics — use those to detect misconfigurations as load ramps, before 100% cutover.
  8. Ramp weights, then flip. Typical progression 1% → 5% → 25% → 50% → 100%. Keep blue live throughout for fast rollback.
  9. Decommission blue. Once green is stable at 100% and any stickiness horizon (long-running sessions, async jobs) has drained through blue, tear down the old mesh.

Why not in-mesh traffic splitting

If both environments could share service discovery (e.g. same Cloud Map namespace, same service names), you could use the new mesh's traffic-splitting primitives (App Mesh's Virtual-Router-+-Virtual-Node weighting, Istio's VirtualService weight:, etc.) to split between old and new backend versions. That's the simpler, in-request-path pattern.

Blue/green is forced when you can't do that: the two meshes own different service-discovery implementations (App Mesh's Virtual Services + Cloud Map ↔ Service Connect's flat Cloud Map namespace), the data planes can't route to each other, and the atomic unit of mesh membership (the ECS Service) can only belong to one mesh at a time. The edge is the only place where both environments are addressable, so the edge is where the split lives.

Contrast with shadow migration

  • patterns/shadow-migration (dual-run with reconciliation) runs the new engine in parallel, fed the same inputs, and compares outputs before any consumer sees the new engine's output. Used when correctness equivalence is the bar (data pipelines, batch processing).
  • Blue/green mesh migration serves live user traffic from both environments at once during the ramp; users hitting green see green's output directly. The bar is "does green work as well as blue under live load?", not "do the outputs reconcile byte-for- byte?"

Shadow fits batch/data; blue/green fits live traffic.

Contrast with subscriber switchover

  • patterns/subscriber-switchover is per-consumer cutover on a shared producer — each downstream consumer switches independently between old and new inputs. Works when consumers have the freedom to reverse course independently and there's one producer serving both.
  • Blue/green mesh migration is the inverse: one shared consumer population (end users), two disjoint producer environments, traffic split at the front door. The user population doesn't choose — the edge router does.

Caveats and tradeoffs

  • You temporarily double the cost of the entire application tier during the ramp. Budget for it.
  • No cross-environment state. If users have sessions / carts / WebSocket connections, a user resolved to green cannot reach blue-side backends and vice versa. For long-lived sessions either use sticky routing (keep each user on one side for the session lifetime) or drain on cutover (disallow new blue sessions once the ramp starts, wait for existing blue sessions to expire).
  • Rollback is cheap. Reverse the weight ramp. This is the primary reason to prefer blue/green over in-place cutover when in-place is feasible.
  • DB / stateful layer is shared or forked, not blue/green. Blue and green typically share the database tier (to avoid data divergence) or dual-write (patterns/dual-write-migration) if the DB itself is also migrating. The mesh migration is only above the persistence layer.
  • Async / background jobs are out-of-band. Edge traffic shifting only covers synchronous request flows. Async workers, cron jobs, queue consumers need their own cutover story — typically flip them last once the user-facing cutover is stable.

Seen in

  • sources/2025-01-18-aws-app-mesh-discontinuation-service-connect-migration — canonical instance. AWS forces the pattern because an ECS Service can't be in both meshes; prescribes three edge-level weighted traffic shifters (Route 53 / CloudFront continuous deployment / ALB multi-target-group); explicit "no cross-environment networking" constraint; Service Connect's free CloudWatch metrics as the ramp monitoring surface.
Last updated · 200 distilled / 1,178 read