PATTERN Cited by 1 source

Multi-cloud LLM serving¶

Pattern¶

Run production LLM-powered features against managed model-serving endpoints from two or more independent cloud providers, fronted by an in-house abstraction layer that unifies API shape, error codes, rate-limiting, telemetry, and authentication, and that routes requests based on metric-driven model selection, real-time health signals, and per-workload optimisation criteria.

The canonical wiki implementation: Slack's Intelligent Routing Layer spanning AWS Bedrock + GCP Vertex AI, reached via a three-year four-phase evolution (early 2023 → early 2026). (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud)

When to use it¶

Production LLM workloads at scale — millions of users multiply the customer-visible impact of any single-cloud outage.
Vendor-exclusive frontier models — when the state-of-the-art for a specific feature is fragmented across providers' catalogues.
High-stakes reliability requirements — internal failover within one cloud is insufficient against provider-wide disruption.
Per-feature optimisation matters — different features have different cost / latency / quality / reasoning profiles, and the optimal model differs per feature.
Compliance allows cross-cloud routing — legal / FedRAMP / sovereignty constraints don't pin you to one provider.

When NOT to use it¶

Single-cloud is sufficient — if all required models are on one cloud at the right SLAs and per-feature optimisation isn't worth the operational tax.
Engineering bandwidth is constrained — the four named taxes (API normalisation, monitoring, attribution, on-call expertise) require sustained investment.
Compliance pins you to one provider — federal contracts or regulated workloads with single-cloud sovereignty requirements.
No vendor-exclusivity pressure — the model frontier is catalogued on one cloud and your roadmap is comfortable there.

Five structural pieces¶

        ┌─────────────────────────────────────────────┐
        │     Application features (Slack AI suite)   │
        └─────────────────────┬───────────────────────┘
                              │  unified internal API
        ┌─────────────────────▼───────────────────────┐
        │   Intelligent Routing Layer                 │
        │                                             │
        │   1. Metric-driven model selection          │
        │      (primary + designated backup per feat) │
        │                                             │
        │   2. Experimental rules / A-B testing       │
        │      (% traffic shaping, in-prod evals)     │
        │                                             │
        │   3. Automated circuit breaker              │
        │      + partial-open recovery state          │
        │      (TTFT, p90 latency, 5xx error rate)    │
        │                                             │
        │   4. API normalization layer                │
        │      (errors, rate-limits, telemetry, auth) │
        │                                             │
        │   5. Secretless cross-cloud authentication  │
        └────────┬───────────────┬────────────────────┘
                 │               │
        ┌────────▼─────┐   ┌─────▼─────────┐
        │ AWS Bedrock  │   │ GCP Vertex AI │
        │ (PT + OD)    │   │ (multi-       │
        │              │   │  provider)    │
        └──────────────┘   └───────────────┘

The pattern requires:

Abstraction layer with unified internal contract.
Per-feature model bindings with primary + backup.
Health-driven circuit breaker for endpoint-level degradation response.
API normalisation for cross-provider error / rate-limit / telemetry uniformity.
Secretless cross-cloud authentication plumbing.

How to evolve to it (four phases)¶

Slack's three-year arc canonicalises the migration trajectory:

Phase 1 — Single cloud, escrow VPC¶

Hosted Anthropic models in escrow VPC on AWS SageMaker. Multi-region within one cloud; ODCR + cron-based scaling. Exposes model feature lag when the provider prioritises a different launchpad (Bedrock).

Phase 2 — Migrate to provider's primary launchpad¶

Move to fully managed Amazon Bedrock with Provisioned Throughput. Eliminates feature lag for that provider. Use the zero-incident LLM migration playbook (compliance / capacity / quality / rollout).

Phase 3 — Hybrid PT + OD with spillover¶

Add On-Demand for bursty workloads (patterns/provisioned-throughput-with-on-demand-spillover). Build internal model fallback hierarchy on the same provider. Exposes concentration risk when single-provider failover is insufficient.

Phase 4 — Multi-cloud expansion¶

Add a second cloud (GCP Vertex AI in Slack's case). Build the Intelligent Routing Layer with API normalisation, cross-cloud auth, model-to- feature binding, and the partial-open circuit breaker. Disclosed outcome: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads.

Trade-offs¶

Compared to…	Wins	Loses
Single-cloud LLM serving	Provider redundancy + best-of-breed model access + per-feature optimisation	API normalisation overhead + cross-cloud monitoring complexity + cost attribution complexity + on-call knowledge breadth
Multi-region single-cloud	Provider-level redundancy beyond regional outages	Same operational taxes; multi-region is cheaper if provider outages are rare
Multi-model single-cloud	Cross-provider model exclusivity coverage	Same operational taxes; multi-model alone doesn't address provider-wide outages
Self-hosted multi-cloud	Maximum flexibility	Loses managed-service operational savings; weights distribution / GPU procurement / scaling all become customer's problem

Operational taxes (Slack disclosed)¶

API and behavioural friction — addressed by API normalisation layer.
Operational monitoring complexity — unified dashboard pulling per-cloud telemetry.
The attribution challenge — per-feature cost tracking when traffic shifts dynamically.
The on-call knowledge gap — engineers can't be single-cloud specialists.

Composition with other patterns¶

patterns/api-normalization-layer-cross-provider — the abstraction enabler.
patterns/model-fallback-hierarchy-with-circuit-breaker — the resilience pattern composed at routing-layer altitude.
patterns/provisioned-throughput-with-on-demand-spillover — composes inside one cloud as the cost / predictability trade-off; multi-cloud composes outside.
patterns/zero-incident-llm-migration — applies to each cross-cloud and cross-substrate migration step.

Reflections (Slack's five takeaways verbatim)¶

"Scaling safely requires XFN parity" — Legal / Risk / Compliance / Security alignment with Engineering as the actual unblocker.
"The abstraction layer is a core requirement" — agility and speed to market are the competitive edge; the routing layer dominates the model choice.
"Treat architecture as a living document" — provider- agnostic routing lets you adopt breakthroughs without a rewrite.
"Reliability requires provider agnosticism" — internal failovers within one cloud aren't enough.
"Redefining the meaning of 'Failure'" — soft failures (p90 spikes, feedback trends) are first-class triggers; an "LLM service that is 'up' but slow is effectively broken".

Risks and mitigations¶

API normalisation drift → provider releases a new error / API and the layer stops normalising. Mitigation: per- provider integration tests + provider-API change monitoring.
Cost attribution gaps → multi-cloud billing complexity hides per-feature cost. Mitigation: deep instrumentation across billing systems.
Cross-cloud auth credential leak → secretless auth reduces but doesn't eliminate. Mitigation: short-lived tokens + auditable federation flows.
Model selection latency / cost overhead → routing layer becomes a hot path bottleneck. Mitigation: per-feature routing decisions cached; only health signals drive re-evaluation.
Compliance drift — multi-cloud expands the data-residency / privacy attack surface. Mitigation: per-cloud regional data boundaries codified as routing constraints.

Seen in¶

sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the multi-cloud LLM serving pattern as the architectural endpoint of Slack's three-year Slack AI evolution. Production substrate for millions of users on AWS Bedrock + GCP Vertex AI behind the Intelligent Routing Layer. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning, ~67% latency reduction on high-velocity workloads.