PATTERN Cited by 1 source
Centralised deployment orchestration across systems¶
Problem¶
A company at scale runs many different deployment systems — one per workload class, per cloud substrate, per era of infrastructure. Each system has its own rollout mechanism, its own rollback capability, its own metric-watching discipline, its own alert integration. A reliability program that wants 10-minute auto-remediate + detect-before-10%-fleet (see patterns/automated-detect-remediate-within-10-minutes and concepts/pre-10-percent-fleet-detection-goal) as a universal target cannot wire that target into each deploy system independently — it becomes a per-system bespoke integration effort with no economies of scale, no shared tooling, no shared alert / metric / rollback substrate.
Solution¶
Build a centralised deployment orchestration system that unifies metrics-based deployments with automatic remediation across multiple deploy substrates. The orchestration system provides:
- Pluggable deploy substrate interface. Each underlying deploy system (Kubernetes, Bedrock, EC2, Terraform, custom internal PaaS) plugs in via a common interface for deploying artifacts, advancing phases, and triggering rollback.
- Shared metric-watching substrate. The same metric-monitoring, alert-integration, and rollback-trigger logic is reused across all underlying systems.
- Shared rollout-policy substrate. Phase structure (1% → 10% → 50% → 100%), pause thresholds, halt criteria, and emergency-bypass escape hatches are defined once and applied to all underlying systems.
- Shared observability. One dashboard, one alert channel, one audit log for all deploys across all systems.
The centralisation is the wedge for lifting a reliability-program North Star target from one workload class to the whole fleet.
Canonical disclosure¶
Slack's 2025-10-07 Deploy Safety retrospective canonicalises the pattern as the program's culminating investment phase (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):
"Aligned further investment toward Slack's centralised deployment orchestration system inspired by ReleaseBot and the AWS Pipelines deployment system to unify the use of metrics-based deployments with automatic remediation beyond Slack Bedrock / Kubernetes to many other deployment systems"
Closing-section roadmap confirms the ongoing ambition:
"Centralised deployment orchestration tooling expansion into other infrastructure deployment patterns: EC2, Terraform and many more"
Substrate examples on Slack¶
- Webapp backend — running on Slack Bedrock + Kubernetes (see systems/slack-bedrock, systems/kubernetes). First substrate with fully-automated metrics-based deploy + automatic rollback (during program 2023-2024).
- Webapp frontend — added manual rollback capability first; automatic rollback on roadmap.
- Mobile apps — faster Mobile App issue detection disclosed as a program investment; the app-store review cycle is a structural constraint on the MTTR target (contrast concepts/platform-asymmetric-rollout-control / Shopify React Native post for Play Store vs App Store rollout mechanics).
- Infra — a portion disclosed as under the centralised orchestration umbrella.
- EC2 — on roadmap.
- Terraform — on roadmap.
Inspirations disclosed¶
Slack names two inspirations for the centralised orchestration system:
- ReleaseBot — Slack's earlier (2018-era) automated deployment system for Webapp backend. See systems/slack-releasebot.
- AWS Pipelines — Amazon's internal deployment orchestration system with built-in alarm-based rollback.
The composition is essentially: "take the ReleaseBot metrics-based-deploy + automatic-rollback pattern, and generalise it across deploy substrates."
Required capabilities¶
To fit a new deploy substrate under the orchestration umbrella, the substrate needs:
- Phased rollout primitive. The substrate must be able to apply a change incrementally (not all-at-once).
- Fast-rollback primitive. Reverting to the previous artifact must be tested and fast.
- Metric-callback hook. The substrate must expose hooks for "is this phase healthy?" — either push-based (substrate emits phase events) or pull-based (orchestrator polls).
- Halt-capability. The substrate must accept "halt" commands that freeze the rollout without rolling back.
- Audit-log. Every rollout event must be loggable.
Deploy substrates without these capabilities (ad-hoc scripts, manual-only deploys, single-artifact updates) must be retrofitted or wrapped before they can fit under the orchestration.
Anti-pattern: per-substrate metrics-and-rollback integration¶
Before this pattern, the default shape is:
- Kubernetes gets its own alert/rollback integration.
- EC2 gets its own alert/rollback integration.
- Each internal-PaaS gets its own alert/rollback integration.
- … each with their own pause-thresholds, rollback-trigger-logic, phase-structure, dashboards, audit trails.
Per-substrate integration is expensive to build once, and expensive to maintain; each substrate's integration drifts independently; each team learns its substrate's rollback UX separately.
The centralised-orchestration pattern collapses N substrate- specific integrations into 1 shared integration + N substrate- adapters. The shared integration is the leverage point — a single improvement (better metric choice, better phase structure, better rollback mechanism) lifts all N substrates.
Relationship to wiki primitives¶
- patterns/automated-detect-remediate-within-10-minutes — the North Star target that this pattern is the mechanism for lifting across substrates.
- patterns/fast-rollback — the per-substrate primitive the orchestration composes.
- patterns/staged-rollout — the per-substrate primitive the orchestration composes.
- concepts/feedback-control-loop-for-rollouts — the shared discipline the orchestration encodes as common logic.
- systems/slack-releasebot — one of the two cited inspirations; the substrate-specific predecessor Slack built for Webapp backend.
- systems/slack-deploy-safety-program — the program that drove the investment.
Related framings elsewhere in the wiki:
- patterns/unified-operator-for-cloud-and-self-managed (Redpanda) — a sibling "one orchestrator, many substrates" pattern but at the operator-altitude rather than deploy-pipeline altitude.
Caveats¶
- The abstraction leaks. Substrates are not identical; their rollout semantics differ (Kubernetes pod-percentage vs EC2 ASG percentage vs Terraform-apply blast radius); the shared orchestrator must either handle the differences or defer to per-substrate configuration.
- Backward-compat with pre-existing substrates. Lifting an existing substrate under the orchestrator is migration work, not a green-field effort; disruption risk must be managed.
- The orchestrator is itself a deploy risk. The deployment- pipeline tooling can itself cause outages when updated (Cloudflare's 2025-12-05 outage is one canonical example) — the orchestrator needs the same metrics-based-deploy discipline it enforces on workloads.
- Not all workloads fit. Some deploys are fundamentally all-or-nothing (database schema changes, irreversible DDL, security patches that must reach 100% immediately) and cannot be gated by the same metrics-based-staged-rollout structure.
- Organisational alignment required. A central orchestrator owned by one team serves N teams; cross-team engagement (see Slack's "direct outreach to individual teams" discipline) is a prerequisite.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — canonical Slack program-culminating investment; inspired by ReleaseBot + AWS Pipelines; ambition extends to Webapp frontend, infra, EC2, Terraform beyond Slack Bedrock / Kubernetes.