SYSTEM Cited by 1 source
Slack Deploy Safety Program¶
What it is¶
The Slack Deploy Safety Program is an 18-month cross-org reliability program run at Slack from mid-2023 through at least January 2025, canonicalised in the 2025-10-07 retrospective Deploy Safety: Reducing customer impact from change.
The program reduced customer impact hours from change-triggered incidents by 90% from its peak quarter (Feb– Apr 2024) to January 2025, while "maintaining Slack's development velocity" — a deliberate co-equal North Star that explicitly rejected the historical reliability reflex of "add manual change processes."
It is listed as a system on the wiki (rather than a pattern or concept) because it's a named, bounded, reviewable organisational artifact — with its own manifesto, OKR weight, exec sponsors, metric, cadence, and roadmap.
Program structure¶
Scope¶
"All of Slack's deployment systems & processes" for the highest-importance services (Webapp backend, Webapp frontend, Mobile apps, portions of infra). Hundreds of internal services, many different deployment systems.
Metric¶
Hours of customer impact from high-severity and selected medium-severity change-triggered incidents. Canonicalised as concepts/customer-impact-hours-metric.
The metric is explicitly an "imperfect analog" of customer sentiment, sitting in a three-layer chain:
Slack disclosed four metric-design criteria: measure results; understand real-vs-analog; consistency in measurement; continual validation.
North Star goals¶
Verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):
- Reducing impact time from deployments
- Automated detection & remediation within 10 minutes
- Manual detection & remediation within 20 minutes
- → canonicalised as patterns/automated-detect-remediate-within-10-minutes
- Reducing severity of impact
- Detect problematic deployments prior to reaching 10% of the fleet
- → canonicalised as concepts/pre-10-percent-fleet-detection-goal
- Maintaining Slack's development velocity
The initial North Star goals "evolved and expanded into a Deploy Safety Manifesto that now applies to all Slack's deployment systems & processes."
Investment strategy¶
Five axes (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):
- Invest widely initially and bias for action.
- Focus on areas of known pain first.
- Invest further in projects or patterns based on results.
- Curtail investment in the least impactful areas.
- Set a flexible shorter-term roadmap which may change based on results.
Canonicalised as patterns/invest-widely-then-double-down-on-impact.
Program-governance cadence¶
- Executive reviews every 4-6 weeks for continued alignment and support.
- High-level priority in company/engineering goals (OKR, V2MOM).
- Solid support from executive leadership with active project alignment, encouragement, and customer feedback. Named sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.
Architectural deliverables¶
Webapp backend (first substrate, fully automated)¶
Quarterly investment sequence disclosed verbatim:
- Q1: Engineer automatic metric monitoring.
- Q2: Confirm customer-impact alignment via automatic alerts and manual rollback actions.
- Q3-Q4: Invest in automatic deployments and rollback.
- Q4+: Prove success with many automatic rollbacks keeping customer impact below 10 minutes.
- Q4+: Further investment to monitor additional metrics + invest in manual rollback optimisations.
- Q4+: Invest in a manual Frontend rollback capability.
- Q4+: Aligned further investment toward the centralised deployment orchestration system inspired by ReleaseBot + AWS Pipelines.
Webapp frontend¶
Manual rollback capability added first; automatic rollback on the program's ongoing roadmap.
Mobile apps¶
Faster Mobile App issue detection disclosed as a program investment. Platform-asymmetric rollout mechanics (Play Store vs App Store) constrain the MTTR target differently per platform — see concepts/platform-asymmetric-rollout-control.
Centralised orchestration system¶
Unifies metrics-based deployments with automatic remediation "beyond Slack Bedrock / Kubernetes to many other deployment systems" — explicitly extending to EC2, Terraform on the roadmap. Inspired by ReleaseBot + AWS Pipelines. Canonicalised as patterns/centralised-deployment-orchestration-across-systems.
Results¶
- 90% reduction in customer impact hours from peak quarter (Feb-Apr 2024) to Jan 2025.
- Peak in Feb–Apr 2024 — after the first quarter of project delivery (Q4 2023 → Q1 2024 metric-monitoring infrastructure) but before automatic-rollback was broadly deployed.
- Dramatic improvement following introduction of automatic rollback.
- Non-linear progress — "the bar chart above tells quite a story of non-linear progress and the difficulties with using trailing metrics based on waiting for incident occurrence."
- 3-6 month lag from project delivery to full impact visibility on the metric.
- Lower 2025 target continuing the downward trend, with "focus on mitigating the risk of infrequent spikes through deployment system consistency."
Named organisational lessons¶
- Prioritisation + alignment (exec review + OKR weight).
- Patience with trailing metrics + faith under uncertainty — see concepts/trailing-metric-patience.
- Folks delay adoption until familiar; use tooling often. — see fluency discipline in patterns/automated-detect-remediate-within-10-minutes.
- Direct outreach to engineering teams is critical — "Not all teams and systems are the same."
- Metric consistency over metric perfection — "Pick a metric, be consistent, and make refinements based on validation of results."
- Maintain direct communication with engineering staff — acknowledged gap the program is actively working to improve.
Program team (disclosed)¶
- Exec sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.
- Team-engagement leads: Petr Pchelko, Harrison Page.
- Program team members: Dave Harrington, Sam Bailey, Sreedevi Rai, Vani Anantha, Matt Jennings, Nathan Steele, Sriganesh Krishnan.
Ongoing roadmap (as of 2025-10-07)¶
- Improvements in automatic metrics-based deployments + remediation.
- More consistency in use of Deploy Safe processes across all deployments — "to mitigate unexpected or infrequent spikes of customer impact."
- Migrating remaining manual deploy processes to code-based deploys using Deploy Safe processes. Explicit scope expansion of the program.
- Centralised deployment orchestration expansion into other infrastructure deployment patterns (EC2, Terraform).
- Automatic rollbacks for Frontend.
- Metric quality improvements — "do we have the right metrics for each system/service?"
- AI metric-based anomaly detection.
- Further rollout of AI-generated pre-production tests.
Positioning on the wiki¶
The Deploy Safety Program is a program-as-system canonicalisation distinct from earlier reliability-program entries in the corpus. Closest analogues:
- Meta's Capacity Efficiency Platform / Kernel Evolve are programs-as-systems at hyperscale altitude; systems/meta-capacity-efficiency-platform.
- patterns/feedback-control-loop-for-rollouts is the per-deploy substrate the program implements organisationally-at-scale.
- concepts/dora-metrics is the industry measurement framework the program's third goal ("maintain development velocity") is implicitly aligned with.
Caveats¶
- Program-as-system framing is by analogy; this page canonicalises organisational, not code, structure.
- Load-bearing disclosures are at program-management altitude, not mechanism altitude. Metrics watched, pause thresholds, rollback triggers, phase structure, and orchestrator internals are not disclosed.
- 90% is program-aggregate; per-project attribution is not given.
- The manifesto itself is not public. The canonical disclosure names that a Deploy Safety Manifesto exists and has grown to cover "all Slack's deployment systems & processes" but the document isn't publicly available.
- Post-Salesforce-acquisition context. Trust as #1 value is the closing frame — the program lives in a Salesforce-trust- value context.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — canonical 18-month retrospective; 90% reduction in customer impact hours; North Star goals; investment strategy; exec sponsors; ongoing roadmap.
Related¶
- companies/slack
- systems/slack-releasebot
- systems/slack-bedrock
- concepts/customer-impact-hours-metric
- concepts/change-triggered-incident-rate
- concepts/pre-10-percent-fleet-detection-goal
- concepts/trailing-metric-patience
- patterns/automated-detect-remediate-within-10-minutes
- patterns/centralised-deployment-orchestration-across-systems
- patterns/invest-widely-then-double-down-on-impact