Skip to content

PATTERN Cited by 1 source

Phased migration with soak times

What it is

Phased migration with soak times is the migration pattern where a fleet-wide change is rolled out in discrete stages (environments, cohorts, clusters, tenants), and each stage is followed by an intentional waiting period — a soak — before the next stage begins. The soak is for latent issues to surface, not for immediate-regression detection.

Distinct from:

  • Canary rollout (1% → 5% → 25% → 100%) — which is a traffic- shifting pattern on a deployed service.
  • Cohort percentage rollout — which operates on one system with tenants as cohorts.
  • Big-bang migration — where everything moves at once.

Phased-with-soak operates on independent units (clusters, environments, sub-platforms) and the soak is calendar time, not traffic volume.

What the soak catches

The soak-time window is for latent-defect discovery:

  • Second-order effects — a config change that looks fine immediately may interact badly with a weekly job that hasn't run yet.
  • Capacity drift — differences in workload scaling over time expose provisioning issues only hours or days after the migration.
  • Dependency-chain surprises — another team's workload lands on the migrated environment and breaks.
  • Customer-facing issues — user reports lag actual issues by hours/days.

Salesforce canonical instance

Salesforce's 2026-01-12 Karpenter migration post names the pattern explicitly:

*"A deliberate, phased rollout strategy was adopted:

  • Mid-2025 to Early 2026 – A multistage migration across internal environments with soak times between stages
  • Start with lower-risk environments – Less critical workloads were migrated first to validate tooling and operational processes
  • Risk-based sequencing – High-stakes production environments continue to be migrated last after testing the process"*

"By using this approach Salesforce, continuously learned and adapted, avoiding large-scale regressions." (Source: sources/2026-01-12-aws-salesforce-karpenter-migration-1000-eks-clusters)

Multi-stage, multi-month, with the sequencing component being separately articulated (patterns/risk-based-sequencing).

Shape

  1. Partition the fleet into migration units that can be migrated independently.
  2. Order the units by risk (low-risk first — see patterns/risk-based-sequencing).
  3. Size the soak — long enough to catch the longest expected latent-defect surface time (typically days, sometimes weeks for enterprise platforms).
  4. Apply each stage, observe through the soak, only advance if the soak completes clean.
  5. Pause / revert / re-plan if the soak reveals issues — the pattern's value is that issues surface on a small cohort before they've touched the whole fleet.

Why it works

  • Error containment. A migration issue affects only the already-migrated stages, not the whole fleet.
  • Tool maturation. The migration tool (see patterns/rollback-capable-migration-tool) and operational runbook mature across stages; by the time the risky stages arrive, the process is well-drilled.
  • Organizational learning loops. Each stage produces lessons that feed back into subsequent stages. Salesforce's five operational lessons (PDB hygiene, sequential cordoning, 63-char labels, singleton protection, ephemeral-storage mapping) were all discovered this way.

Trade-offs

  • Calendar time cost. Soaks push total migration duration from weeks to quarters. A year-long Karpenter migration is not unusual at large scale.
  • Dual-system cost. While the migration is in flight, both old and new systems must be maintained.
  • Harder for shared resources. Some cross-cutting shared resources (networking, DNS, IAM) are hard to migrate per-stage; they force coordinated big-bang treatment on their own axis.

Seen in

Last updated · 200 distilled / 1,178 read