SYSTEM Cited by 1 source
Slack Intelligent Routing Layer¶
The Intelligent Routing Layer is Slack AI's internal LLM abstraction layer that fronts every model provider Slack uses (initially Amazon Bedrock, expanded to GCP Vertex AI in early 2026) and exposes a unified internal contract to feature teams. It is the architectural endpoint of Slack AI's three-year multi-cloud evolution (Source: sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud).
The post frames it verbatim: "By enhancing our abstraction layer into an Intelligent Routing Layer, we ensured that users receive the fastest, highest-quality response available. If one model or provider slows down, the system instantly reroutes the request to a better-performing alternative, making the underlying complexity completely invisible to the user while maintaining a seamless experience."
Architecture¶
┌──────────────────────────────────────────────┐
│ Slack AI Features │
│ (Search, Recap, Summaries, AI Search, etc.) │
└────────────────────┬─────────────────────────┘
│ unified internal API
┌────────────────────▼─────────────────────────┐
│ Intelligent Routing Layer │
│ ┌──────────────────────────────────────┐ │
│ │ 1. Metric-driven model selection │ │
│ │ + designated backup models │ │
│ ├──────────────────────────────────────┤ │
│ │ 2. Experimental rules / A-B testing │ │
│ │ (% traffic shaping, fast loop) │ │
│ ├──────────────────────────────────────┤ │
│ │ 3. Automated circuit breaker │ │
│ │ + health monitoring (TTFT, p90, │ │
│ │ 5xx) + partial-open recovery │ │
│ ├──────────────────────────────────────┤ │
│ │ 4. API normalization layer │ │
│ │ (rate-limit, errors, telemetry) │ │
│ ├──────────────────────────────────────┤ │
│ │ 5. Secretless cross-cloud auth │ │
│ └──────────────────────────────────────┘ │
└────────┬──────────────┬──────────────────────┘
│ │
┌────────▼─────┐ ┌─────▼─────────┐
│ AWS Bedrock │ │ GCP Vertex AI │
│ (PT + OD) │ │ (multi- │
│ │ │ provider) │
└──────────────┘ └───────────────┘
Five subsystems (disclosed)¶
1. Metric-driven model selection¶
Per-feature model bindings are chosen from internal quality metrics. Verbatim: "We use our internal quality metrics to determine the optimal model for each feature. For instance, if our benchmarks show a specific LLM outperforms others for 'Recaps,' the router directs traffic accordingly. Crucially, we always designate backup models for every feature; if the primary choice doesn't meet our performance or quality thresholds in real-time, the system knows exactly where to go next."
This canonicalises the primary + backup pair as a property of every feature configuration, not an exception. See concepts/model-to-feature-binding and patterns/model-fallback-hierarchy-with-circuit-breaker.
2. Experimental rules & A/B testing¶
Traffic-percentage routing for new-model evaluation. Verbatim: "This capability has fundamentally changed our release velocity. When we wanted to test the latest LLMs, after our security and compliance verifications, for our Recaps feature, we were able to route a percentage of traffic to the new model with minimal code changes and an incredibly fast turnaround time. This allowed us to validate performance in the wild and tighten our feedback loop from weeks to days."
Two structural properties:
- Code-light — A/B route changes are configuration, not application code.
- Production-grounded — "validate performance in the wild" rather than synthetic eval benchmarks alone.
3. Automated circuit breaker & health monitoring¶
Endpoint-level real-time watchdog with partial-open recovery state. Verbatim: "This system acts as a real-time watchdog, constantly monitoring health signals at the endpoint level. If a specific provider or model begins to exhibit signs of distress – such as an elevated Time to First Token (TTFT), a spike in 5xx error rates, or crossing a latency p90 threshold – the circuit 'trips.' Once tripped, the routing layer automatically diverts traffic to a healthy alternative model based on the use case and complexity. Crucially, the breaker enters a partial-open state, allowing a small, controlled trickle of requests to reach the degraded endpoint. As the endpoint demonstrates sustained health, the system dynamically expands this trickle, incrementally ramping traffic back up until the breaker is fully 'closed' and normal operations resume. This ensures a graceful recovery without overwhelming a stabilizing service."
Three named health signals:
- Time To First Token (TTFT) — primary latency signal.
- 5xx error rate spike — primary error signal.
- p90 latency threshold — distribution-aware signal.
The partial-open ramp is the wiki canonical instance of gradual traffic ramp on circuit recovery — a refinement of the classical circuit breaker that prevents thundering-herd-on-recovery.
4. API normalization layer¶
The cross-provider abstraction. Verbatim: "Each provider has its own unique API patterns, proprietary error codes, and distinct rate-limiting behaviors. We had to build a robust normalization layer to ensure that a 'Rate Limit Exceeded' from one provider and a 'Throttling Exception' from another were handled identically by our application logic."
Three named axes the normalisation covers:
- API patterns — request/response shapes vary across providers.
- Proprietary error codes — provider-specific failure taxonomies.
- Rate-limiting behaviours — different throttling and retry semantics.
See patterns/api-normalization-layer-cross-provider + concepts/api-normalization-multi-cloud-llm.
5. Secretless cross-cloud authentication¶
Mentioned briefly as one of the "cold start engineering hurdles" solved during GCP Vertex AI integration. Verbatim: "To make this a reality, we solved cold start engineering hurdles by implementing secretless authentication and an API Normalization layer that translates disparate provider signals into a unified language for our application logic."
Specific identity-federation shape (workload identity federation, AWS-GCP IAM bridging, short-lived OIDC tokens) is not disclosed.
What the routing layer enables (disclosed outcomes)¶
| Outcome | Source mechanism |
|---|---|
| ~10% quality lift on complex reasoning | metric-driven model selection routing AI Search to high-reasoning models |
| ~67% latency reduction on high-velocity / low-token workloads | model-to-feature binding routing to the speed-optimised model per task |
| Feedback loop weeks → days | A/B traffic shaping enabled fast in-production model validation |
| Provider outage transparency to users | circuit breaker + fallback hierarchy + cross-cloud routing |
Composition with phases¶
The routing layer was iteratively built across phases:
- Phase 1 / 2 — basic routing across SageMaker + Bedrock endpoints + intra-Bedrock model upgrades via feature flags + evaluation frameworks.
- Phase 3 — adds the PT-with-OD-spillover capability and the model fallback hierarchy for in-Bedrock degradation events.
- Phase 4 — "enhancing our abstraction layer into an Intelligent Routing Layer" with the cross-cloud capability, API normalisation, secretless auth, and the dynamic telemetry-driven traffic shaping.
Trade-offs¶
Per the post's "Multi-Cloud Reality" section, the routing layer's existence imposes four operational taxes Slack accepts in return for the abstraction:
- API and behavioural friction — addressed by the normalisation layer.
- Operational monitoring complexity — addressed by the unified monitoring stack.
- The attribution challenge — "Accurately tracking the cost per feature internally becomes significantly harder when workloads are shifting dynamically between clouds. This required deep instrumentation across multiple billing systems."
- The on-call knowledge gap — "engineers can no longer be specialists in just one ecosystem."
Open questions¶
- Numerical thresholds — exact TTFT, p90, 5xx thresholds that trip the breaker; ramp-rate policy for the partial-open state.
- Eval substrate — what tooling powers "internal quality metrics" feeding the metric-driven model selection (in-house? open source? MLflow LLM judges? other)?
- Traffic-share between AWS and GCP at Phase 4 — not disclosed.
- Per-feature primary/backup model bindings — concrete feature-to-model mappings not enumerated.
- Cross-cloud rate-limit coordination — when one provider's shared OD pool is saturated and the other is healthy, how does the layer decide to spill over vs slow down?
- Cold-start engineering specifics beyond "secretless authentication" — workload identity federation? OIDC? Long-lived service accounts disabled? Not specified.
- Telemetry stream shape — is there a single unified metric pipeline, or per-cloud collectors that feed a unified dashboard?
Seen in¶
- sources/2026-05-28-slack-slack-ai-the-path-to-multi-cloud — canonical wiki disclosure of the Intelligent Routing Layer as the architectural endpoint of Slack AI's three-year multi-cloud evolution; five-subsystem decomposition (metric- driven selection / A-B testing / circuit breaker / API normalisation / secretless auth) with verbatim TTFT + p90 + 5xx + partial-open ramp disclosure.
Related¶
- systems/slack-ai — the consumer feature suite.
- systems/amazon-bedrock — primary US-cloud endpoint.
- systems/gcp-vertex-ai — secondary US-cloud endpoint added in Phase 4.
- systems/aws-sagemaker-ai — Phase 1 endpoint.
- concepts/multi-cloud-llm-serving — the architectural posture.
- concepts/api-normalization-multi-cloud-llm — the cross-provider abstraction concept.
- concepts/automated-circuit-breaker-with-partial-open-state — the resilience primitive.
- concepts/model-to-feature-binding — the per-feature optimisation concept.
- concepts/concentration-risk-single-cloud-llm — the failure mode the layer mitigates.
- patterns/multi-cloud-llm-serving — the meta-pattern.
- patterns/api-normalization-layer-cross-provider — the abstraction enabler.
- patterns/model-fallback-hierarchy-with-circuit-breaker — the resilience pattern.
- patterns/provisioned-throughput-with-on-demand-spillover — the Phase 3 hybrid pattern.
- patterns/circuit-breaker — the classical pattern this refines.