Skip to content

PATTERN Cited by 1 source

Centralised deployment orchestration across systems

Problem

A company at scale runs many different deployment systems — one per workload class, per cloud substrate, per era of infrastructure. Each system has its own rollout mechanism, its own rollback capability, its own metric-watching discipline, its own alert integration. A reliability program that wants 10-minute auto-remediate + detect-before-10%-fleet (see patterns/automated-detect-remediate-within-10-minutes and concepts/pre-10-percent-fleet-detection-goal) as a universal target cannot wire that target into each deploy system independently — it becomes a per-system bespoke integration effort with no economies of scale, no shared tooling, no shared alert / metric / rollback substrate.

Solution

Build a centralised deployment orchestration system that unifies metrics-based deployments with automatic remediation across multiple deploy substrates. The orchestration system provides:

  1. Pluggable deploy substrate interface. Each underlying deploy system (Kubernetes, Bedrock, EC2, Terraform, custom internal PaaS) plugs in via a common interface for deploying artifacts, advancing phases, and triggering rollback.
  2. Shared metric-watching substrate. The same metric-monitoring, alert-integration, and rollback-trigger logic is reused across all underlying systems.
  3. Shared rollout-policy substrate. Phase structure (1% → 10% → 50% → 100%), pause thresholds, halt criteria, and emergency-bypass escape hatches are defined once and applied to all underlying systems.
  4. Shared observability. One dashboard, one alert channel, one audit log for all deploys across all systems.

The centralisation is the wedge for lifting a reliability-program North Star target from one workload class to the whole fleet.

Canonical disclosure

Slack's 2025-10-07 Deploy Safety retrospective canonicalises the pattern as the program's culminating investment phase (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

"Aligned further investment toward Slack's centralised deployment orchestration system inspired by ReleaseBot and the AWS Pipelines deployment system to unify the use of metrics-based deployments with automatic remediation beyond Slack Bedrock / Kubernetes to many other deployment systems"

Closing-section roadmap confirms the ongoing ambition:

"Centralised deployment orchestration tooling expansion into other infrastructure deployment patterns: EC2, Terraform and many more"

Substrate examples on Slack

  • Webapp backend — running on Slack Bedrock + Kubernetes (see systems/slack-bedrock, systems/kubernetes). First substrate with fully-automated metrics-based deploy + automatic rollback (during program 2023-2024).
  • Webapp frontend — added manual rollback capability first; automatic rollback on roadmap.
  • Mobile apps — faster Mobile App issue detection disclosed as a program investment; the app-store review cycle is a structural constraint on the MTTR target (contrast concepts/platform-asymmetric-rollout-control / Shopify React Native post for Play Store vs App Store rollout mechanics).
  • Infra — a portion disclosed as under the centralised orchestration umbrella.
  • EC2 — on roadmap.
  • Terraform — on roadmap.

Inspirations disclosed

Slack names two inspirations for the centralised orchestration system:

  • ReleaseBot — Slack's earlier (2018-era) automated deployment system for Webapp backend. See systems/slack-releasebot.
  • AWS Pipelines — Amazon's internal deployment orchestration system with built-in alarm-based rollback.

The composition is essentially: "take the ReleaseBot metrics-based-deploy + automatic-rollback pattern, and generalise it across deploy substrates."

Required capabilities

To fit a new deploy substrate under the orchestration umbrella, the substrate needs:

  1. Phased rollout primitive. The substrate must be able to apply a change incrementally (not all-at-once).
  2. Fast-rollback primitive. Reverting to the previous artifact must be tested and fast.
  3. Metric-callback hook. The substrate must expose hooks for "is this phase healthy?" — either push-based (substrate emits phase events) or pull-based (orchestrator polls).
  4. Halt-capability. The substrate must accept "halt" commands that freeze the rollout without rolling back.
  5. Audit-log. Every rollout event must be loggable.

Deploy substrates without these capabilities (ad-hoc scripts, manual-only deploys, single-artifact updates) must be retrofitted or wrapped before they can fit under the orchestration.

Anti-pattern: per-substrate metrics-and-rollback integration

Before this pattern, the default shape is:

  • Kubernetes gets its own alert/rollback integration.
  • EC2 gets its own alert/rollback integration.
  • Each internal-PaaS gets its own alert/rollback integration.
  • … each with their own pause-thresholds, rollback-trigger-logic, phase-structure, dashboards, audit trails.

Per-substrate integration is expensive to build once, and expensive to maintain; each substrate's integration drifts independently; each team learns its substrate's rollback UX separately.

The centralised-orchestration pattern collapses N substrate- specific integrations into 1 shared integration + N substrate- adapters. The shared integration is the leverage point — a single improvement (better metric choice, better phase structure, better rollback mechanism) lifts all N substrates.

Relationship to wiki primitives

Related framings elsewhere in the wiki:

Caveats

  • The abstraction leaks. Substrates are not identical; their rollout semantics differ (Kubernetes pod-percentage vs EC2 ASG percentage vs Terraform-apply blast radius); the shared orchestrator must either handle the differences or defer to per-substrate configuration.
  • Backward-compat with pre-existing substrates. Lifting an existing substrate under the orchestrator is migration work, not a green-field effort; disruption risk must be managed.
  • The orchestrator is itself a deploy risk. The deployment- pipeline tooling can itself cause outages when updated (Cloudflare's 2025-12-05 outage is one canonical example) — the orchestrator needs the same metrics-based-deploy discipline it enforces on workloads.
  • Not all workloads fit. Some deploys are fundamentally all-or-nothing (database schema changes, irreversible DDL, security patches that must reach 100% immediately) and cannot be gated by the same metrics-based-staged-rollout structure.
  • Organisational alignment required. A central orchestrator owned by one team serves N teams; cross-team engagement (see Slack's "direct outreach to individual teams" discipline) is a prerequisite.

Seen in

Last updated · 470 distilled / 1,213 read