PATTERN Cited by 1 source
Zero-incident LLM migration¶
Pattern¶
Migrate production LLM serving from one substrate to another through a four-step playbook — Compliance → Capacity → Quality → Rollout — that establishes parity with the prior substrate before any production traffic moves, then shifts traffic via feature flags with instant rollback.
The canonical wiki implementation: Slack's Phase 1 → Phase 2 migration from AWS SageMaker (with escrow VPC for Anthropic) to fully managed Amazon Bedrock in mid- 2024 (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud). Outcome: zero customer-facing incidents during the switch.
"Switching an entire backend while serving live traffic can sometimes be a recipe for disaster. We avoided that by being borderline obsessed with parity, and achieved zero customer-facing incidents. We didn't just run unit tests; we ran massive load tests and shadow requests to find the exact 'Model Unit' count that matched our old setup. We used feature flags to slowly bleed traffic over, so if anything looked even slightly off, we could yank it back in seconds. It wasn't magic – it was just a lot of cautious plumbing."
When to use it¶
- Cross-substrate or cross-provider LLM serving migrations — same-vendor model upgrade is simpler; cross-substrate is where this playbook earns its operational tax.
- High-stakes traffic — millions of users, FedRAMP / enterprise compliance posture, customer-trust-sensitive features.
- Both substrates produce comparable but not identical output — quality parity is non-trivial; the playbook forces the parity check up front.
- Capacity primitives differ between substrates — SageMaker GPU instances vs Bedrock Model Units required Slack to "map the exact number of Model Units (MUs) required to match our SageMaker baseline."
When NOT to use it¶
- Greenfield deployment — no prior substrate to maintain parity with.
- In-place model swap — same substrate, same API; simpler patterns suffice.
- Low-stakes / experimental workloads — the playbook's operational tax exceeds the value at low scale.
The four-step playbook (verbatim from Slack)¶
1. Compliance¶
"Secured Legal, Security, and FedRamp sign-offs before rerouting production traffic to maintain our existing high bar for data privacy."
The compliance step happens before any traffic moves. Reframes compliance from a gate at the end of the migration to a precondition at the start. Rationale: discovering a compliance gap mid-rollout requires rolling back, which defeats the zero-incident property.
2. Capacity¶
"Conducted extensive load tests to map the exact number of Model Units (MUs) required to match our SageMaker baseline across diverse traffic profiles."
The capacity step solves the substrate-specific capacity- primitive mismatch:
- Source substrate: SageMaker, denominated in GPU instances + their throughput characteristics.
- Target substrate: Bedrock, denominated in Model Units / minute.
The mapping requires load testing across diverse traffic profiles, not a back-of-envelope calculation. The output is "the exact number of MUs" needed for parity — quantified, not estimated.
3. Quality¶
"Used A/B testing and evaluation frameworks to compare environment outputs side-by-side, verifying both quality and latency parity."
Quality parity is verified before traffic moves. Specifically:
- A/B testing — same input, two substrates, side-by-side output comparison.
- Evaluation frameworks — automated quality scoring (the exact substrate isn't disclosed; LLM-as-judge is implied).
- Quality + latency parity — both axes verified, not just quality.
4. Rollout¶
"Implemented gradual traffic shifts via feature flags and instant rollback capabilities, ensuring 100% availability during the switch."
Two structural properties:
- Gradual — "slowly bleed traffic over" — small percentage increments, not step-function cutovers.
- Instant rollback — "if anything looked even slightly off, we could yank it back in seconds." The feature flag is bidirectional; a slowdown signal triggers immediate rollback, not a queue-and-wait.
Three structural pieces¶
Compliance Capacity Quality Rollout
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Legal │ │Load │ │A/B │ │Feature│
│Sec │ │tests │ │tests │ │flag │
│Compl │ │Shadow│ │Eval │ │Slow │
│sign- │ │reqs │ │frame-│ │bleed │
│offs │ │MU │ │work │ │Instant│
│ │ │match │ │ │ │roll- │
│ │ │ │ │ │ │back │
└──────┘ └──────┘ └──────┘ └──────┘
(precondition) (parity) (verification) (cutover)
What this pattern solves¶
- Compliance-gap-mid-rollout — moves the compliance gate to the start, eliminating the late-stage rollback.
- Capacity-mismatch-mid-rollout — load-tests the substrate-specific MU mapping before traffic moves.
- Quality-regression-mid-rollout — verifies parity in parallel before traffic moves, not after.
- Slow-rollback-on-incident — feature flags + small percentage increments enable seconds-to-rollback rather than minutes-to-hours.
Composition with other patterns¶
- patterns/feature-flagged-dual-implementation — the rollout step's mechanism.
- Shadow request testing (implicit) — Slack mentions "shadow requests" as part of the load test step; classic technique applied at LLM-serving scale.
- A/B testing for quality validation — the quality step's mechanism.
- patterns/multi-cloud-llm-serving — applies this playbook for each cross-cloud expansion (e.g. Slack's Phase 3 → Phase 4 GCP Vertex AI addition implicitly used the same playbook, gated by the cross-functional alignment Slack named as "XFN parity").
Trade-offs¶
| Compared to… | Wins | Loses |
|---|---|---|
| Big-bang cutover | Zero-incident; instant rollback; quality + capacity parity verified | Slower migration timeline; engineering investment in load testing + A/B framework |
| Manual phase rollout | Faster than ad-hoc | Less rigorous; capacity / quality parity might be skipped under deadline |
| Compliance-as-final-step | Faster perceived progress | Incurs late-stage compliance-gap rollback risk |
Engineering principles solidified¶
Slack's verbatim takeaway:
"This solidified a core Slack AI engineering principle: measure first, migrate gradually, and monitor continuously."
The three sub-disciplines map to:
- Measure first — Capacity + Quality steps.
- Migrate gradually — Rollout step.
- Monitor continuously — feature flag + observability + the rollback trigger.
Risks and mitigations¶
- Quality eval substrate is itself flawed — A/B framework reports false parity. Mitigation: human review of a sample
- downstream business-metric monitoring during rollout.
- Capacity load test doesn't reflect production traffic shape — MU mapping is right for the test profile but wrong for production. Mitigation: shadow requests before full cutover; gradual percentage rollout that reveals capacity mismatch at low risk.
- Compliance signs off on artifacts that drift before rollout — model version updates, region changes, etc. Mitigation: re-validate compliance immediately before rollout begins.
- Feature flag fails closed — flag system itself malfunctions; can't roll back. Mitigation: test the rollback path before rollout begins; per-feature flag observability.
Seen in¶
- sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the zero-incident LLM migration playbook as Slack's Phase 1 → Phase 2 SageMaker → Bedrock mid-2024 migration. Verbatim "borderline obsessed with parity" + "a lot of cautious plumbing" + "100% availability during the switch" + "yank it back in seconds" framings. Output: zero customer-facing incidents.
Related¶
- concepts/multi-cloud-llm-serving
- concepts/escrow-vpc-llm-serving
- concepts/llm-model-feature-lag
- concepts/model-units
- systems/aws-sagemaker-ai
- systems/amazon-bedrock
- systems/slack-ai
- patterns/multi-cloud-llm-serving
- patterns/feature-flagged-dual-implementation
- patterns/zero-downtime-reparent-on-degradation