Slack AI: The Path to Multi-Cloud¶
Summary¶
Three-year retrospective from the Slack AI infrastructure team on evolving the LLM serving substrate behind Slack AI from a single-region SageMaker deployment in early 2023 to a multi-cloud architecture spanning AWS SageMaker → Amazon Bedrock (Provisioned Throughput then On-Demand) → GCP Vertex AI in early 2026. The post canonicalises the multi-cloud LLM serving pattern as a four-phase evolution driven by structural pressures that no single-provider stack could resolve: GPU hardware scarcity, scaling latency, capacity over-provisioning, model feature lag, commitment lock-in, provider concentration risk, and the strategic need to access vendor-exclusive state-of-the-art models. The architectural endpoint is an Intelligent Routing Layer that abstracts away provider complexity behind a unified internal API, with metric-driven model selection, experimental A/B traffic shaping, and an automated circuit breaker that gradually ramps traffic back to recovering endpoints. Concrete reported outcomes from multi-cloud: ~10% improvement in quality metrics for complex reasoning tasks and ~67% reduction in latency for high-velocity, low-token workloads by binding specific features to the model with the right latent strengths.
Key takeaways¶
-
Multi-cloud LLM serving is a destination reached through forcing functions, not a starting position. Phase 1 (SageMaker) was the natural starting point in early 2023: security, FedRAMP compliance, model availability, and the ability to host Anthropic models via an escrow VPC that established a zero-knowledge environment (Slack data private to Slack; provider weights inaccessible to Slack). Slack only moved to multi-cloud after each prior phase exposed structural limits that the next vendor or surface specifically resolved. (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud)
-
SageMaker phase exposed three operational taxes verbatim: "Scaling Latency: Initialization times prevented instantaneous scaling. Hardware Scarcity: Enterprise-grade Nvidia GPUs, such as the A100 (Ampere architecture) and the emerging H100 (Hopper architecture) instances, were often unavailable. Over-Provisioning: Maintaining idle resources to meet peak SLAs." By early 2024 these were partially mitigated via On-Demand Capacity Reservations (ODCR) and proactive cron-based scaling — "a hard truth: we were spending too many engineering cycles on plumbing. To scale, we needed automated capacity, not manual coordination."
-
Model feature lag was the primary driver of the SageMaker → Bedrock migration. Verbatim: "Hosting Anthropic models via an escrow VPC led to a 'catch-up' cycle. Model iterations and optimizations often debuted on Bedrock weeks or months before SageMaker availability." The structural cause: AWS prioritised Bedrock as the primary launchpad for new LLMs. For Slack, "staying at the bleeding edge of model quality is a competitive necessity." (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud)
-
Bedrock introduced Model Units (MUs) as the unit of capacity — "Each MU provides a deterministic amount of throughput, measured in tokens per minute. Shifting from GPU instances to MUs allowed us to abstract away the hardware and focus entirely on raw throughput." Slack's framing matches Databricks' MU framing from 2026-05-27: MU as the LLM capacity primitive that decouples workload sizing from underlying hardware. Slack's wiki canonical instance is the customer-side adoption of the MU primitive at purchase scale, distinct from Databricks' platform-side coining of the same primitive.
-
Phase 2 zero-incident migration to Bedrock relied on a four-step playbook canonicalised here as zero-incident LLM migration: (a) Compliance — Legal / Security / FedRAMP sign-offs "before rerouting production traffic to maintain our existing high bar for data privacy"; (b) Capacity — load tests to "map the exact number of Model Units (MUs) required to match our SageMaker baseline across diverse traffic profiles"; (c) Quality — A/B testing + evaluation frameworks for side-by-side output comparison verifying quality + latency parity; (d) Rollout — feature-flag- gated traffic shifts with instant rollback. Verbatim: "It wasn't magic – it was just a lot of cautious plumbing." Solidified the Slack AI engineering principle: "measure first, migrate gradually, and monitor continuously."
-
Provisioned Throughput hit two efficiency walls that drove the move to On-Demand: the over-provisioning cycle (US East/West Coast morning surge sets the global peak, with significant idle capacity during APAC/EU mornings and weekends — "a persistent efficiency gap") and commitment lock-in (PT contracts of one to six months — "In the fast-moving world of LLMs, where a state-of-the-art model can be superseded in weeks, these commitments effectively slowed down our ability to upgrade. Even when a superior model was released, we often chose to wait for our existing commitments to expire before migrating.")
-
Phase 3 Hybrid Routing canonicalises the PT-with-OD-spillover pattern verbatim: "We kept high-volume, latency-sensitive features on dedicated capacity (Provisioned Throughput) to ensure a consistent 'snappy' feel. Simultaneously, we moved asynchronous, bursty workloads – like nightly Recaps – to On-Demand capacity. To bridge the gap, we engineered a Spillover Pattern: if a sudden surge pushed us beyond our reserved limits, excess requests automatically 'spilled over' to on-demand endpoints, ensuring we never dropped a request due to capacity ceilings." For features with 10× variance between peak and off-peak hours, the efficiency gains were "substantial."
-
OD trade-offs gave rise to internal model fallback hierarchy — Bedrock OD operates on a shared-resource model with three distinct trade-offs disclosed: service-level variability (different uptime characteristics from PT's dedicated nature), regional capacity orchestration (success depends on provider's ability to manage shared demand across customers in specific regions), and concentration risk ("Relying too heavily on a single provider's on-demand pool meant that any service-wide blip could have the potential to impact entire Slack AI features simultaneously"). The mitigation was a model hierarchy for every AI feature: "the system to automatically fall back to different models if the primary model reached a degraded state. Some examples of regressions are elevated time to first token latencies, throttling errors, and downward trend in customer feedback." This canonicalises model fallback hierarchy with circuit breaker as a wiki pattern.
-
Phase 4 Multi-Cloud expansion to GCP Vertex AI in early 2026 had four explicit drivers verbatim: infrastructural redundancy & high availability (provider-level disruption eliminated as single point of failure), model-to-feature optimisation (canonicalised as concepts/model-to-feature-binding), access to innovation (vendor-exclusive state-of-the-art models), and dynamic workload orchestration (real-time telemetry-driven traffic shaping). Disclosed quantitative outcomes from this granular optimisation: ~10% improvement in quality metrics for complex reasoning tasks and ~67% reduction in latency for high-velocity, low-token workloads.
-
The architectural endpoint is the Intelligent Routing Layer — the platform abstraction that consumes provider complexity and exposes a unified internal contract. Three named subsystems: (a) metric-driven model selection — "if our benchmarks show a specific LLM outperforms others for 'Recaps,' the router directs traffic accordingly. Crucially, we always designate backup models for every feature"; (b) experimental rules & A/B testing — "we were able to route a percentage of traffic to the new model with minimal code changes and an incredibly fast turnaround time" — feedback-loop tightening "from weeks to days"; (c) automated circuit breaker & health monitoring — endpoint-level monitoring of TTFT, 5xx error rates, p90 latency thresholds; the partial-open recovery state gradually ramps traffic back as the endpoint demonstrates sustained health.
-
Multi-cloud reality is a conscious operational tax, not a free win. Four costs disclosed verbatim: API and behavioural friction ("Each provider has its own unique API patterns, proprietary error codes, and distinct rate-limiting behaviors. We had to build a robust normalization layer to ensure that a 'Rate Limit Exceeded' from one provider and a 'Throttling Exception' from another were handled identically by our application logic" — patterns/api-normalization-layer-cross-provider); operational monitoring complexity ("We had to build a unified monitoring stack that integrates telemetry from the multiple clouds into a single view"); the attribution challenge (per-feature cost tracking "becomes significantly harder when workloads are shifting dynamically between clouds"); and the on-call knowledge gap ("engineers can no longer be specialists in just one ecosystem").
-
Five reflections verbatim: (1) "Scaling safely requires XFN parity" — Legal / Risk / Compliance / Security alignment with Engineering as the actual unblocker. (2) "The abstraction layer is a core requirement" — agility and speed to market are the competitive edge; the routing layer's design dominates the model choice. (3) "Treat architecture as a living document" — "Managed services mature monthly. Because we remained provider-agnostic, we can now adopt breakthroughs in latency or reasoning without a total rewrite." (4) "Reliability requires provider agnosticism" — internal failovers aren't enough. (5) "Redefining the meaning of 'Failure'" — "An LLM service that is 'up' but slow is effectively broken." Soft failures (p90 spikes, feedback trends) are first-class triggers for the routing layer.
Architecture¶
The four phases¶
Phase 1 (early 2023) Phase 2 (mid-2024)
┌──────────────────────┐ ┌──────────────────────┐
│ AWS SageMaker │ │ Amazon Bedrock │
│ - Anthropic via │ ─────▶ │ - Provisioned │
│ escrow VPC │ │ Throughput (PT) │
│ - Multi-region │ │ - On-Demand (OD) │
│ - Cross-region IAM │ │ - Model Units (MU) │
│ - ODCR + cron │ │ - Fully managed │
│ scaling │ │ - Latest models │
└──────────────────────┘ └──────────────────────┘
Phase 3 (Hybrid) Phase 4 (Multi-Cloud)
┌──────────────────────┐ ┌──────────────────────┐
│ Hybrid Routing │ │ AWS Bedrock │
│ - PT for snappy │ ─────▶ │ + GCP Vertex AI │
│ latency-sensitive │ │ - Intelligent │
│ - OD for bursty │ │ Routing Layer │
│ - Spillover PT→OD │ │ - Model fallback │
│ - Model fallback │ │ hierarchy │
│ hierarchy │ │ - Circuit breaker │
└──────────────────────┘ └──────────────────────┘
The Intelligent Routing Layer¶
At Phase 4, the routing layer abstracts provider complexity:
┌──────────────────────────────────────────────┐
│ Slack AI Features │
│ (Search, Recap, Summaries, AI Search, etc.) │
└────────────────────┬─────────────────────────┘
│
┌────────────────────▼─────────────────────────┐
│ Intelligent Routing Layer │
│ ┌──────────────────────────────────┐ │
│ │ Metric-driven model selection │ │
│ │ Experimental rules / A-B testing │ │
│ │ Automated circuit breaker │ │
│ │ API normalization layer │ │
│ │ Secretless cross-cloud auth │ │
│ └──────────────────────────────────┘ │
└────────┬──────────────┬──────────────────────┘
│ │
┌────────▼─────┐ ┌─────▼─────────┐
│ AWS Bedrock │ │ GCP Vertex AI │
│ - PT + OD │ │ - Multiple │
│ - Anthropic, │ │ providers │
│ Mistral, │ │ exclusive │
│ Meta, etc. │ │ on GCP │
└──────────────┘ └───────────────┘
Operational numbers¶
| Datum | Value | Source |
|---|---|---|
| Slack AI launch | early 2023 | Phase 1 |
| Bedrock migration | mid-2024 | Phase 2 |
| GCP Vertex AI added | early-2026 | Phase 4 |
| Quality lift on complex reasoning | ~10% | Phase 4 |
| Latency reduction on high-velocity / low-token workloads | ~67% | Phase 4 |
| Peak / off-peak variance per feature | up to 10× | Phase 3 motivation |
| PT commitment terms | 1–6 months | Phase 2/3 |
| Model feature lag (Bedrock vs SageMaker for Anthropic) | weeks to months | Phase 1 → 2 motivation |
| Migration incident count (SageMaker → Bedrock) | 0 | Phase 2 takeaway |
| GPU types referenced | A100 (Ampere), H100 (Hopper) | Phase 1 |
| FedRAMP compliance bar | maintained across all phases | All phases |
| Number of cloud providers at endpoint | 2 (AWS + GCP) | Phase 4 |
Caveats¶
- Tier-2 article passes scope decisively on multi-cloud LLM serving architecture grounds — distributed-systems-internals, scaling trade-offs, infrastructure architecture, production capacity-management trade-offs, named subsystem design.
- No specific provider model SKUs disclosed — Anthropic "models", OpenAI / Google not enumerated; the article argues the abstraction is the point.
- No dollar figures or absolute throughput numbers — only the +10% quality and -67% latency outcomes for Phase 4 are quantified; PT vs OD savings are characterised as "substantial" without specific numbers.
- Routing-layer internals partially disclosed — circuit- breaker thresholds (p90 latency, 5xx rate, TTFT), partial- open ramp policy, and the metric-quality benchmark feedback loop are described qualitatively but specific numerical thresholds are not stated.
- Cross-cloud egress / data-sovereignty story incomplete — the article references "adhering to our regional data boundaries" and Bedrock's cross-US-region routing, but the specifics of GCP region selection, US-vs-EU routing, and per-customer tenancy scope are not detailed.
- Workload-specific routing examples thin — "Recaps" named as the canonical bursty / async / OD example; "channel summaries" named as the canonical PT / latency- sensitive example; "AI Search" mentioned in the high-reasoning quality lift; specific feature-to-provider bindings at Phase 4 not enumerated.
- No commit traffic-share split between AWS and GCP — the article explicitly frames GCP as "not just as a failover for redundancy, but as a strategic engine to accelerate product innovation" without disclosing the fraction of traffic served from each cloud.
- API normalisation layer mechanism only sketched — example given is "a 'Rate Limit Exceeded' from one provider and a 'Throttling Exception' from another were handled identically by our application logic"; the actual schema, error-code mapping, telemetry conversion are not disclosed.
- Secretless authentication mentioned without depth — "we solved cold start engineering hurdles by implementing secretless authentication" — specific identity federation shape (workload identity federation, AWS-GCP IAM bridging, short-lived OIDC) is not specified.
- MLflow-style observability, evals, and benchmarking — mentioned as the basis for "metric-driven model selection" but the eval substrate is not named (in contrast to e.g. Databricks' MLflow LLM judges from the 2026-05-22 OTel-tracing post).
- Three-year arc framing is retrospective — phases presented as a clean four-phase progression; no acknowledgement of false starts, abandoned alternatives, or workloads that didn't fit the pattern.
- No on-call / oncology / cost dashboard disclosure — the article mentions the "on-call knowledge gap" and "attribution challenge" as multi-cloud taxes but doesn't describe the tooling Slack built (or chose) to close them.
Source¶
- Original: https://slack.engineering/slack-ai-the-path-to-multi-cloud/
- Raw markdown:
raw/slack/2026-05-28-slack-ai-the-path-to-multi-cloud-6c75284c.md
Related¶
- systems/slack-ai — the consumer feature suite (Search, Recap, summaries, AI Search) whose serving substrate this article describes.
- systems/slack-intelligent-routing-layer — the architectural endpoint Phase 4 produces.
- systems/aws-sagemaker-ai — Phase 1 substrate.
- systems/amazon-bedrock — Phase 2/3 substrate.
- systems/gcp-vertex-ai — Phase 4 added cloud.
- concepts/multi-cloud-llm-serving — the architectural posture canonicalised here.
- concepts/escrow-vpc-llm-serving — Phase 1 zero-knowledge environment.
- concepts/llm-model-feature-lag — primary driver of Phase 1 → 2.
- concepts/provisioned-throughput-vs-on-demand-llm — the PT/OD axis.
- concepts/llm-over-provisioning-cycle — the Phase 2 → 3 driver.
- concepts/llm-provider-commitment-lock-in — the Phase 2 → 3 second driver.
- concepts/api-normalization-multi-cloud-llm — the Phase 4 cross-cloud abstraction.
- concepts/model-to-feature-binding — Phase 4 quality + latency wins.
- concepts/concentration-risk-single-cloud-llm — the Phase 3 → 4 reliability driver.
- concepts/automated-circuit-breaker-with-partial-open-state — the resilience primitive.
- concepts/model-units — the Bedrock capacity primitive Slack adopts on the customer side.
- patterns/multi-cloud-llm-serving — the meta-pattern.
- patterns/provisioned-throughput-with-on-demand-spillover — the Phase 3 hybrid.
- patterns/api-normalization-layer-cross-provider — the abstraction enabler.
- patterns/model-fallback-hierarchy-with-circuit-breaker — the resilience pattern.
- patterns/zero-incident-llm-migration — the Phase 2 migration playbook.
- companies/slack