PATTERN Cited by 1 source
Five-phase managed-service migration playbook¶
Intent¶
Move a large self-managed production deployment (database, search cluster, message broker, etc.) onto the vendor's managed equivalent without downtime, with a named, repeatable phase ordering that front-loads the surprises you can't catch from a data copy alone.
MongoDB Professional Services uses this playbook for Community-Edition → Atlas migrations; it generalises to any self-managed → managed-service transition where the source and target run the same core engine but sit on different sides of the shared-responsibility line.
The five phases¶
- Design — define scope + strategy. Timeline, resources, dependencies named up front. Analyse data volume, data structure, and source-vs-target compatibility (storage engines, server versions, available features).
- De-risk — assess + mitigate. Validate application compatibility against the managed service; check driver versions; catalogue breaking changes. "Understanding compatibility challenges early on helped us eliminate surprises during production." (BharatPE's Sumit Malik)
- Test — validate in a mirrored lower environment. Stand up a fully mirrored managed-service test environment; integrate existing applications; run sanity + compatibility checks. Adding a test server (beyond the replica-set minimum) lets the team simulate real-world cutover scenarios without touching prod.
- Migrate — transition data with security in the loop. Use the vendor's continuous-replication tool (MongoDB's mongosync, AWS DMS, Cloud SQL Database Migration Service, MSK's migration facilities) with in-transit encryption for regulated workloads. Terabyte-scale data moves run alongside the source cluster; reads can be served from source while writes fan out to both during the overlap.
- Validate — confirm integrity + optimise. Automated scripts compare source/target record-by-record; real-time alerting catches drift the moment it appears; post-cutover monitoring gives the team time to see operational properties they couldn't simulate in the Test phase.
Why the ordering matters¶
- De-risk before Test. Driver-version and application-compat surprises discovered during Test are expensive — the mirrored environment has already been built. De-risk pulls them forward to desk-research cost.
- Test before Migrate. The Test phase is where the shadow-migration shape lives — a mirror of production (the managed service plus integrated application stack) where behavioural parity is confirmed before any bulk data copy runs. Without it, Validate becomes the first time you run the new topology under real load.
- Validate separate from Migrate. Cutover ends with data moved, not with data proven identical. The Validate phase is a dedicated step (automated integrity scripts + monitoring) not a checkbox inside Migrate. This is what makes the migration non-disruptive per patterns/nondisruptive-migration — you can rewind if Validate surfaces drift.
Structural guarantees¶
- Data in-flight during Migrate is never the authoritative source for a read in the new system until Validate signs off. The vendor replication tool handles staging; the application still talks to the source cluster.
- The Test-phase environment is built to match production topology + integrations, not just "a fresh instance of the managed service". Mirroring is load-bearing.
- Security features move from operational burden to product feature at the Migrate phase. Encryption in-transit, encryption at rest, RBAC, VPC peering, audit logs — all asserted before the cutover, validated in Test.
Where it applies beyond MongoDB¶
- RDBMS self-host → RDS / Aurora / Cloud SQL. AWS DMS + binlog replication play the role of mongosync; the 5-phase shape holds.
- Self-managed Kafka → MSK / Confluent Cloud. MirrorMaker 2 / Cluster Linking for the Migrate phase; topic-parity verification for Validate.
- Self-hosted Elasticsearch → OpenSearch Service / Elastic Cloud. Snapshot-and-restore or CCR for Migrate; query-result equivalence sampling for Validate.
- Self-managed Redis / Memcached → ElastiCache / MemoryStore. Online replication + dual-read window.
Anti-patterns it avoids¶
- "Point-in-time export and import." Skips Design + De-risk; cutover is the only validation window.
- "Forklift off-hours cutover." Assumes a downtime window large enough to copy 45 TB + validate.
- "Lift the bytes, worry about drivers later." Application-compat surprises found after cutover are rolled back to a source cluster that's now drifted.
- "Dev environment is the test environment." Skips the mirrored managed-service integration; test surfaces are narrower than prod surfaces.
Costs¶
- Vendor PS engagement. The playbook is load-bearing on having the vendor's migration tool + a team who's run it on comparable volumes. Self-driven versions exist but carry more Design + De-risk cost.
- Parallel infrastructure through the Migrate and Validate windows. Source + target cluster both running; mirrored Test environment running.
- In-transit encryption during Migrate. Required for regulated workloads; CPU / throughput cost on the replication tool.
- Driver / application changes surfaced in De-risk. Often force an application-release window that has to land before Migrate.
Trade-off vs. a simpler cutover¶
The five phases are justified when:
- The deployment is terabyte-scale or sharded — full downtime windows aren't feasible.
- The workload is regulated (fintech, healthcare, PCI, GDPR) — compliance evidence has to be producible for the cutover.
- Drivers / application versions / engine versions mismatch between source and target — de-risking saves a rollback.
- Disaster-recovery posture is part of the migration goal — the managed service is expected to absorb HA + failover, which has to be proven in Test before becoming production policy.
For a small, unregulated, single-cluster move, collapsing De-risk + Test + Validate into a single dual-write weekend is cheaper.
Seen in¶
- sources/2025-09-21-mongodb-community-edition-to-atlas-a-migration-masterclass-with-bharatpe — BharatPE moved 45 TB across 3 sharded clusters (each 1 primary + 2 secondary) from self-hosted MongoDB Community Edition to Atlas using mongosync for the Migrate phase. Regulated Indian-fintech workload (UPI + zero-MDR payments, ~₹12,000 crore/month). Post-migration: 99.995% Atlas SLA, 40% self-reported query-response-time improvement, managed audit logs + RBAC + VPC peering + encryption replacing the self-hosted compliance-tooling stack.
Related¶
- patterns/shadow-migration — the Test phase inside this playbook.
- patterns/dual-system-sync-during-migration — what mongosync / DMS / MirrorMaker implement during Migrate.
- patterns/nondisruptive-migration — the property this playbook enables.
- patterns/achievable-target-first-migration — complementary lesson: pick a tractable first cluster when doing multi-cluster managed-service moves.
- concepts/shared-responsibility-model — the line that moves during the migration.
- systems/mongodb-atlas
- systems/mongosync