Skip to content

PATTERN Cited by 1 source

Multi-cloud LLM serving

Pattern

Run production LLM-powered features against managed model-serving endpoints from two or more independent cloud providers, fronted by an in-house abstraction layer that unifies API shape, error codes, rate-limiting, telemetry, and authentication, and that routes requests based on metric-driven model selection, real-time health signals, and per-workload optimisation criteria.

The canonical wiki implementation: Slack's Intelligent Routing Layer spanning AWS Bedrock + GCP Vertex AI, reached via a three-year four-phase evolution (early 2023 → early 2026). (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud)

When to use it

  • Production LLM workloads at scale — millions of users multiply the customer-visible impact of any single-cloud outage.
  • Vendor-exclusive frontier models — when the state-of-the-art for a specific feature is fragmented across providers' catalogues.
  • High-stakes reliability requirements — internal failover within one cloud is insufficient against provider-wide disruption.
  • Per-feature optimisation matters — different features have different cost / latency / quality / reasoning profiles, and the optimal model differs per feature.
  • Compliance allows cross-cloud routing — legal / FedRAMP / sovereignty constraints don't pin you to one provider.

When NOT to use it

  • Single-cloud is sufficient — if all required models are on one cloud at the right SLAs and per-feature optimisation isn't worth the operational tax.
  • Engineering bandwidth is constrained — the four named taxes (API normalisation, monitoring, attribution, on-call expertise) require sustained investment.
  • Compliance pins you to one provider — federal contracts or regulated workloads with single-cloud sovereignty requirements.
  • No vendor-exclusivity pressure — the model frontier is catalogued on one cloud and your roadmap is comfortable there.

Five structural pieces

        ┌─────────────────────────────────────────────┐
        │     Application features (Slack AI suite)   │
        └─────────────────────┬───────────────────────┘
                              │  unified internal API
        ┌─────────────────────▼───────────────────────┐
        │   Intelligent Routing Layer                 │
        │                                             │
        │   1. Metric-driven model selection          │
        │      (primary + designated backup per feat) │
        │                                             │
        │   2. Experimental rules / A-B testing       │
        │      (% traffic shaping, in-prod evals)     │
        │                                             │
        │   3. Automated circuit breaker              │
        │      + partial-open recovery state          │
        │      (TTFT, p90 latency, 5xx error rate)    │
        │                                             │
        │   4. API normalization layer                │
        │      (errors, rate-limits, telemetry, auth) │
        │                                             │
        │   5. Secretless cross-cloud authentication  │
        └────────┬───────────────┬────────────────────┘
                 │               │
        ┌────────▼─────┐   ┌─────▼─────────┐
        │ AWS Bedrock  │   │ GCP Vertex AI │
        │ (PT + OD)    │   │ (multi-       │
        │              │   │  provider)    │
        └──────────────┘   └───────────────┘

The pattern requires:

  1. Abstraction layer with unified internal contract.
  2. Per-feature model bindings with primary + backup.
  3. Health-driven circuit breaker for endpoint-level degradation response.
  4. API normalisation for cross-provider error / rate-limit / telemetry uniformity.
  5. Secretless cross-cloud authentication plumbing.

How to evolve to it (four phases)

Slack's three-year arc canonicalises the migration trajectory:

Phase 1 — Single cloud, escrow VPC

Hosted Anthropic models in escrow VPC on AWS SageMaker. Multi-region within one cloud; ODCR + cron-based scaling. Exposes model feature lag when the provider prioritises a different launchpad (Bedrock).

Phase 2 — Migrate to provider's primary launchpad

Move to fully managed Amazon Bedrock with Provisioned Throughput. Eliminates feature lag for that provider. Use the zero-incident LLM migration playbook (compliance / capacity / quality / rollout).

Phase 3 — Hybrid PT + OD with spillover

Add On-Demand for bursty workloads (patterns/provisioned-throughput-with-on-demand-spillover). Build internal model fallback hierarchy on the same provider. Exposes concentration risk when single-provider failover is insufficient.

Phase 4 — Multi-cloud expansion

Add a second cloud (GCP Vertex AI in Slack's case). Build the Intelligent Routing Layer with API normalisation, cross-cloud auth, model-to- feature binding, and the partial-open circuit breaker. Disclosed outcome: ~10% quality lift on complex reasoning + ~67% latency reduction on high-velocity / low-token workloads.

Trade-offs

Compared to… Wins Loses
Single-cloud LLM serving Provider redundancy + best-of-breed model access + per-feature optimisation API normalisation overhead + cross-cloud monitoring complexity + cost attribution complexity + on-call knowledge breadth
Multi-region single-cloud Provider-level redundancy beyond regional outages Same operational taxes; multi-region is cheaper if provider outages are rare
Multi-model single-cloud Cross-provider model exclusivity coverage Same operational taxes; multi-model alone doesn't address provider-wide outages
Self-hosted multi-cloud Maximum flexibility Loses managed-service operational savings; weights distribution / GPU procurement / scaling all become customer's problem

Operational taxes (Slack disclosed)

  1. API and behavioural friction — addressed by API normalisation layer.
  2. Operational monitoring complexity — unified dashboard pulling per-cloud telemetry.
  3. The attribution challenge — per-feature cost tracking when traffic shifts dynamically.
  4. The on-call knowledge gap — engineers can't be single-cloud specialists.

Composition with other patterns

Reflections (Slack's five takeaways verbatim)

  1. "Scaling safely requires XFN parity" — Legal / Risk / Compliance / Security alignment with Engineering as the actual unblocker.
  2. "The abstraction layer is a core requirement" — agility and speed to market are the competitive edge; the routing layer dominates the model choice.
  3. "Treat architecture as a living document" — provider- agnostic routing lets you adopt breakthroughs without a rewrite.
  4. "Reliability requires provider agnosticism" — internal failovers within one cloud aren't enough.
  5. "Redefining the meaning of 'Failure'" — soft failures (p90 spikes, feedback trends) are first-class triggers; an "LLM service that is 'up' but slow is effectively broken".

Risks and mitigations

  • API normalisation drift → provider releases a new error / API and the layer stops normalising. Mitigation: per- provider integration tests + provider-API change monitoring.
  • Cost attribution gaps → multi-cloud billing complexity hides per-feature cost. Mitigation: deep instrumentation across billing systems.
  • Cross-cloud auth credential leak → secretless auth reduces but doesn't eliminate. Mitigation: short-lived tokens + auditable federation flows.
  • Model selection latency / cost overhead → routing layer becomes a hot path bottleneck. Mitigation: per-feature routing decisions cached; only health signals drive re-evaluation.
  • Compliance drift — multi-cloud expands the data-residency / privacy attack surface. Mitigation: per-cloud regional data boundaries codified as routing constraints.

Seen in

  • sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the multi-cloud LLM serving pattern as the architectural endpoint of Slack's three-year Slack AI evolution. Production substrate for millions of users on AWS Bedrock + GCP Vertex AI behind the Intelligent Routing Layer. Disclosed Phase 4 outcomes: ~10% quality lift on complex reasoning, ~67% latency reduction on high-velocity workloads.
Last updated · 542 distilled / 1,571 read