SYSTEM Cited by 1 source

Slack Deploy Safety Program¶

What it is¶

The Slack Deploy Safety Program is an 18-month cross-org reliability program run at Slack from mid-2023 through at least January 2025, canonicalised in the 2025-10-07 retrospective Deploy Safety: Reducing customer impact from change.

The program reduced customer impact hours from change-triggered incidents by 90% from its peak quarter (Feb– Apr 2024) to January 2025, while "maintaining Slack's development velocity" — a deliberate co-equal North Star that explicitly rejected the historical reliability reflex of "add manual change processes."

It is listed as a system on the wiki (rather than a pattern or concept) because it's a named, bounded, reviewable organisational artifact — with its own manifesto, OKR weight, exec sponsors, metric, cadence, and roadmap.

Program structure¶

Scope¶

"All of Slack's deployment systems & processes" for the highest-importance services (Webapp backend, Webapp frontend, Mobile apps, portions of infra). Hundreds of internal services, many different deployment systems.

Metric¶

Hours of customer impact from high-severity and selected medium-severity change-triggered incidents. Canonicalised as concepts/customer-impact-hours-metric.

The metric is explicitly an "imperfect analog" of customer sentiment, sitting in a three-layer chain:

Customer sentiment <-> Program Metric <-> Project Metric

Slack disclosed four metric-design criteria: measure results; understand real-vs-analog; consistency in measurement; continual validation.

North Star goals¶

Verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

Reducing impact time from deployments
Automated detection & remediation within 10 minutes
Manual detection & remediation within 20 minutes
→ canonicalised as patterns/automated-detect-remediate-within-10-minutes
Reducing severity of impact
Detect problematic deployments prior to reaching 10% of the fleet
→ canonicalised as concepts/pre-10-percent-fleet-detection-goal
Maintaining Slack's development velocity

The initial North Star goals "evolved and expanded into a Deploy Safety Manifesto that now applies to all Slack's deployment systems & processes."

Investment strategy¶

Five axes (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

Invest widely initially and bias for action.
Focus on areas of known pain first.
Invest further in projects or patterns based on results.
Curtail investment in the least impactful areas.
Set a flexible shorter-term roadmap which may change based on results.

Canonicalised as patterns/invest-widely-then-double-down-on-impact.

Program-governance cadence¶

Executive reviews every 4-6 weeks for continued alignment and support.
High-level priority in company/engineering goals (OKR, V2MOM).
Solid support from executive leadership with active project alignment, encouragement, and customer feedback. Named sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.

Architectural deliverables¶

Webapp backend (first substrate, fully automated)¶

Quarterly investment sequence disclosed verbatim:

Q1: Engineer automatic metric monitoring.
Q2: Confirm customer-impact alignment via automatic alerts and manual rollback actions.
Q3-Q4: Invest in automatic deployments and rollback.
Q4+: Prove success with many automatic rollbacks keeping customer impact below 10 minutes.
Q4+: Further investment to monitor additional metrics + invest in manual rollback optimisations.
Q4+: Invest in a manual Frontend rollback capability.
Q4+: Aligned further investment toward the centralised deployment orchestration system inspired by ReleaseBot + AWS Pipelines.

Webapp frontend¶

Manual rollback capability added first; automatic rollback on the program's ongoing roadmap.

Mobile apps¶

Faster Mobile App issue detection disclosed as a program investment. Platform-asymmetric rollout mechanics (Play Store vs App Store) constrain the MTTR target differently per platform — see concepts/platform-asymmetric-rollout-control.

Centralised orchestration system¶

Unifies metrics-based deployments with automatic remediation "beyond Slack Bedrock / Kubernetes to many other deployment systems" — explicitly extending to EC2, Terraform on the roadmap. Inspired by ReleaseBot + AWS Pipelines. Canonicalised as patterns/centralised-deployment-orchestration-across-systems.

Results¶

90% reduction in customer impact hours from peak quarter (Feb-Apr 2024) to Jan 2025.
Peak in Feb–Apr 2024 — after the first quarter of project delivery (Q4 2023 → Q1 2024 metric-monitoring infrastructure) but before automatic-rollback was broadly deployed.
Dramatic improvement following introduction of automatic rollback.
Non-linear progress — "the bar chart above tells quite a story of non-linear progress and the difficulties with using trailing metrics based on waiting for incident occurrence."
3-6 month lag from project delivery to full impact visibility on the metric.
Lower 2025 target continuing the downward trend, with "focus on mitigating the risk of infrequent spikes through deployment system consistency."

Named organisational lessons¶

Prioritisation + alignment (exec review + OKR weight).
Patience with trailing metrics + faith under uncertainty — see concepts/trailing-metric-patience.
Folks delay adoption until familiar; use tooling often. — see fluency discipline in patterns/automated-detect-remediate-within-10-minutes.
Direct outreach to engineering teams is critical — "Not all teams and systems are the same."
Metric consistency over metric perfection — "Pick a metric, be consistent, and make refinements based on validation of results."
Maintain direct communication with engineering staff — acknowledged gap the program is actively working to improve.

Program team (disclosed)¶

Exec sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.
Team-engagement leads: Petr Pchelko, Harrison Page.
Program team members: Dave Harrington, Sam Bailey, Sreedevi Rai, Vani Anantha, Matt Jennings, Nathan Steele, Sriganesh Krishnan.

Ongoing roadmap (as of 2025-10-07)¶

Improvements in automatic metrics-based deployments + remediation.
More consistency in use of Deploy Safe processes across all deployments — "to mitigate unexpected or infrequent spikes of customer impact."
Migrating remaining manual deploy processes to code-based deploys using Deploy Safe processes. Explicit scope expansion of the program.
Centralised deployment orchestration expansion into other infrastructure deployment patterns (EC2, Terraform).
Automatic rollbacks for Frontend.
Metric quality improvements — "do we have the right metrics for each system/service?"
AI metric-based anomaly detection.
Further rollout of AI-generated pre-production tests.

Positioning on the wiki¶

The Deploy Safety Program is a program-as-system canonicalisation distinct from earlier reliability-program entries in the corpus. Closest analogues:

Meta's Capacity Efficiency Platform / Kernel Evolve are programs-as-systems at hyperscale altitude; systems/meta-capacity-efficiency-platform.
patterns/feedback-control-loop-for-rollouts is the per-deploy substrate the program implements organisationally-at-scale.
concepts/dora-metrics is the industry measurement framework the program's third goal ("maintain development velocity") is implicitly aligned with.

Caveats¶

Program-as-system framing is by analogy; this page canonicalises organisational, not code, structure.
Load-bearing disclosures are at program-management altitude, not mechanism altitude. Metrics watched, pause thresholds, rollback triggers, phase structure, and orchestrator internals are not disclosed.
90% is program-aggregate; per-project attribution is not given.
The manifesto itself is not public. The canonical disclosure names that a Deploy Safety Manifesto exists and has grown to cover "all Slack's deployment systems & processes" but the document isn't publicly available.
Post-Salesforce-acquisition context. Trust as #1 value is the closing frame — the program lives in a Salesforce-trust- value context.

Seen in¶

sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — canonical 18-month retrospective; 90% reduction in customer impact hours; North Star goals; investment strategy; exec sponsors; ongoing roadmap.