PATTERN Cited by 1 source
Incremental operator-by-operator migration¶
Pattern¶
When the source state of a migration is a catalogue of operator types (Airflow operators, code patterns, library calls — N distinct shapes that all need replacing), migrate one operator type at a time as its own mini-project. Don't attempt the full N-operator switchover in one phase. Each operator's deprecation has its own pilot, validation, rollout, and final cleanup; the next operator only starts after the previous one is at 100%.
It is slower than parallel migration of all operators. The trade-off Slack identifies in their 2026-05-05 retrospective:
"Progressive operator deprecation: We deprecated operators one at a time (CrunchExecOperator, then S3SyncOperator, etc.). Each deprecation was its own mini-project with testing and validation. While it was slower than migrating everything at once, it greatly mitigated the risk of the migration."
When the pattern applies¶
Apply when all of these are true:
- The migration source is a catalogue of distinct operator / client-pattern shapes, each used by many DAGs / call sites.
- Each shape has its own subtle semantics that risk surfacing per-shape failures during cutover.
- Failures during cutover are expensive (data-pipeline downtime, missed SLAs).
- You can stage the deprecations: code can be written that prevents new usage of the old shape while existing DAGs using it continue to run.
What "one operator at a time" looks like¶
Slack's structure for each operator type's deprecation:
- Build the replacement — a new operator that uses the target submission path (e.g. an Airflow operator that submits via Quarry REST API).
- Pilot in dev / staging — validate the replacement handles the shapes the legacy operator handled.
- Migrate consumers — DAG-by-DAG conversion of every call site to the new operator.
- Track progress — analytics dashboard backed by Airflow metadata-DB queries to identify remaining usage of the legacy operator.
- Restrict creation of new usage — once at high coverage, prevent new DAGs from importing the legacy operator.
- Final cleanup — migrate the remaining stragglers.
- Deprecate the legacy operator — remove from the codebase; declare 100% complete for this operator type.
- Move to the next operator type.
Slack named two operator types deprecated in this style: CrunchExecOperator and S3SyncOperator. The post implies five more operator types went through the same process (7 total).
Why this beats big-bang switchover¶
- Per-operator failures are isolated. A bug in the new Spark operator doesn't risk the Hive migration; a network issue in Region A's CLI cutover doesn't risk Region B's Spark cutover.
- Each rollout informs the next. Lessons from CrunchExecOperator's deprecation feed into S3SyncOperator's deprecation: better network-topology mapping (the EKM-connectivity story), earlier resource-limit testing (the vmem-check story), better team-comms (the "better communication about operator restrictions" lesson from the post).
- Progress is visible. "Progress visibility kept the project moving." Each operator's 100% mark is a celebrate-able milestone, sustaining momentum on a 3-quarter project.
- Rollback scope is small. If a particular operator's cutover hits a blocker, only that operator's DAGs need reverting, not the whole platform.
What this pattern requires upfront¶
- A clear inventory of operator types in use (Slack used Airflow metadata-DB queries to enumerate them).
- A target architecture that all operators converge on (Slack's case: every operator funnels through Quarry's REST API). See patterns/rest-gateway-for-compute-engine-job-submission.
- Progress-tracking infrastructure built before the migration starts — a per-operator dashboard or burndown query (see concepts/observability-before-migration).
- An organisational forcing function, ideally exec-sponsored. Slack's Phase 3 made the migration an OKR Key Result with executive visibility.
Coordination across regions¶
Slack ran this pattern across 8 independent data regions. Each region had its own pilot for a given operator type, then parallel rollouts. Per-region rollouts were staggered, not simultaneous, so region-specific failure modes (network configurations, data-sovereignty rules, cluster versions) didn't cascade across regions. The per-operator-per-region mini-project shape was the right altitude — N operators × M regions = N×M mini-projects, each one bounded enough to manage.
Composition with adjacent patterns¶
- patterns/phased-migration-with-soak-times — the project envelope (Slack's 5 phases — POC, Security Review, OKR Execution, Bulk Migration, Final Cleanup); operator-by-operator is the work-decomposition discipline inside the phase envelope.
- patterns/expand-migrate-contract — each operator's deprecation is a mini expand-migrate-contract: add the new operator (expand), migrate consumers (migrate), remove the legacy operator (contract).
- patterns/audit-then-refactor-migration — closely related intuition; this pattern is the unit-of-refactor discipline.
Anti-patterns this avoids¶
- All-operators-cutover-on-the-same-day. Maximises blast radius; minimises learnability; failures concentrate.
- Long-tail abandonment. Stopping at 80% because the last 20% is harder; legacy operator never gets fully retired and the migration's value (eliminating the old substrate) is never realised. Slack's Phase 5 "Final Cleanup" explicitly closed this tail.
- Unprioritised parallelism. Letting every team pick their own cutover order without a unified inventory; nobody knows what's left.
Failure modes¶
- Communication gap when restricting new usage. Slack flags this verbatim: "When we restricted SSHOperator to prevent new usage during the final migration phase, some teams weren't aware. Better advance notice to all Airflow users would've prevented confusion and friction."
- Last-operator drag. The final operator type often contains the hardest-to-migrate edge cases. Plan for it taking longer than the average.
- Per-operator dashboards drift. If progress dashboards aren't updated as new operator types are added or as DAGs are renamed, the burndown lies.
Seen in¶
- sources/2026-05-05-slack-from-ssh-to-rest-a-security-driven-modernization-of-slacks-emr-data-pipelines — canonical wiki source. Slack deprecated 7 operator types (CrunchExecOperator, S3SyncOperator named; 5 unnamed) one at a time across 8 regions over 3 quarters, with each deprecation as its own mini-project.
Related¶
- patterns/phased-migration-with-soak-times — the project-envelope sibling.
- patterns/rest-gateway-for-compute-engine-job-submission — the destination architecture.
- patterns/audit-then-refactor-migration — the audit-driven lineage of this discipline.
- patterns/expand-migrate-contract — the per-operator deprecation shape.
- concepts/observability-before-migration — the monitoring discipline that makes this pattern executable.
- concepts/rest-based-job-submission — the paradigm shift the migration converges on.
- concepts/ssh-job-execution-anti-pattern — the legacy shape Slack's instance was migrating away from.
- systems/apache-airflow — the orchestration substrate where Slack's operators lived.
- systems/slack-quarry — the destination operator surface.