Skip to content

SYSTEM Cited by 1 source

Slack Deploy Safety Program

What it is

The Slack Deploy Safety Program is an 18-month cross-org reliability program run at Slack from mid-2023 through at least January 2025, canonicalised in the 2025-10-07 retrospective Deploy Safety: Reducing customer impact from change.

The program reduced customer impact hours from change-triggered incidents by 90% from its peak quarter (Feb– Apr 2024) to January 2025, while "maintaining Slack's development velocity" — a deliberate co-equal North Star that explicitly rejected the historical reliability reflex of "add manual change processes."

It is listed as a system on the wiki (rather than a pattern or concept) because it's a named, bounded, reviewable organisational artifact — with its own manifesto, OKR weight, exec sponsors, metric, cadence, and roadmap.

Program structure

Scope

"All of Slack's deployment systems & processes" for the highest-importance services (Webapp backend, Webapp frontend, Mobile apps, portions of infra). Hundreds of internal services, many different deployment systems.

Metric

Hours of customer impact from high-severity and selected medium-severity change-triggered incidents. Canonicalised as concepts/customer-impact-hours-metric.

The metric is explicitly an "imperfect analog" of customer sentiment, sitting in a three-layer chain:

Customer sentiment <-> Program Metric <-> Project Metric

Slack disclosed four metric-design criteria: measure results; understand real-vs-analog; consistency in measurement; continual validation.

North Star goals

Verbatim (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

The initial North Star goals "evolved and expanded into a Deploy Safety Manifesto that now applies to all Slack's deployment systems & processes."

Investment strategy

Five axes (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

  • Invest widely initially and bias for action.
  • Focus on areas of known pain first.
  • Invest further in projects or patterns based on results.
  • Curtail investment in the least impactful areas.
  • Set a flexible shorter-term roadmap which may change based on results.

Canonicalised as patterns/invest-widely-then-double-down-on-impact.

Program-governance cadence

  • Executive reviews every 4-6 weeks for continued alignment and support.
  • High-level priority in company/engineering goals (OKR, V2MOM).
  • Solid support from executive leadership with active project alignment, encouragement, and customer feedback. Named sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.

Architectural deliverables

Webapp backend (first substrate, fully automated)

Quarterly investment sequence disclosed verbatim:

  1. Q1: Engineer automatic metric monitoring.
  2. Q2: Confirm customer-impact alignment via automatic alerts and manual rollback actions.
  3. Q3-Q4: Invest in automatic deployments and rollback.
  4. Q4+: Prove success with many automatic rollbacks keeping customer impact below 10 minutes.
  5. Q4+: Further investment to monitor additional metrics + invest in manual rollback optimisations.
  6. Q4+: Invest in a manual Frontend rollback capability.
  7. Q4+: Aligned further investment toward the centralised deployment orchestration system inspired by ReleaseBot + AWS Pipelines.

Webapp frontend

Manual rollback capability added first; automatic rollback on the program's ongoing roadmap.

Mobile apps

Faster Mobile App issue detection disclosed as a program investment. Platform-asymmetric rollout mechanics (Play Store vs App Store) constrain the MTTR target differently per platform — see concepts/platform-asymmetric-rollout-control.

Centralised orchestration system

Unifies metrics-based deployments with automatic remediation "beyond Slack Bedrock / Kubernetes to many other deployment systems" — explicitly extending to EC2, Terraform on the roadmap. Inspired by ReleaseBot + AWS Pipelines. Canonicalised as patterns/centralised-deployment-orchestration-across-systems.

Results

  • 90% reduction in customer impact hours from peak quarter (Feb-Apr 2024) to Jan 2025.
  • Peak in Feb–Apr 2024 — after the first quarter of project delivery (Q4 2023 → Q1 2024 metric-monitoring infrastructure) but before automatic-rollback was broadly deployed.
  • Dramatic improvement following introduction of automatic rollback.
  • Non-linear progress"the bar chart above tells quite a story of non-linear progress and the difficulties with using trailing metrics based on waiting for incident occurrence."
  • 3-6 month lag from project delivery to full impact visibility on the metric.
  • Lower 2025 target continuing the downward trend, with "focus on mitigating the risk of infrequent spikes through deployment system consistency."

Named organisational lessons

  1. Prioritisation + alignment (exec review + OKR weight).
  2. Patience with trailing metrics + faith under uncertainty — see concepts/trailing-metric-patience.
  3. Folks delay adoption until familiar; use tooling often. — see fluency discipline in patterns/automated-detect-remediate-within-10-minutes.
  4. Direct outreach to engineering teams is critical"Not all teams and systems are the same."
  5. Metric consistency over metric perfection"Pick a metric, be consistent, and make refinements based on validation of results."
  6. Maintain direct communication with engineering staff — acknowledged gap the program is actively working to improve.

Program team (disclosed)

  • Exec sponsors: SVP Milena Talavera, SVP Peter Secor, VP Cisco Vila.
  • Team-engagement leads: Petr Pchelko, Harrison Page.
  • Program team members: Dave Harrington, Sam Bailey, Sreedevi Rai, Vani Anantha, Matt Jennings, Nathan Steele, Sriganesh Krishnan.

Ongoing roadmap (as of 2025-10-07)

  • Improvements in automatic metrics-based deployments + remediation.
  • More consistency in use of Deploy Safe processes across all deployments — "to mitigate unexpected or infrequent spikes of customer impact."
  • Migrating remaining manual deploy processes to code-based deploys using Deploy Safe processes. Explicit scope expansion of the program.
  • Centralised deployment orchestration expansion into other infrastructure deployment patterns (EC2, Terraform).
  • Automatic rollbacks for Frontend.
  • Metric quality improvements — "do we have the right metrics for each system/service?"
  • AI metric-based anomaly detection.
  • Further rollout of AI-generated pre-production tests.

Positioning on the wiki

The Deploy Safety Program is a program-as-system canonicalisation distinct from earlier reliability-program entries in the corpus. Closest analogues:

Caveats

  • Program-as-system framing is by analogy; this page canonicalises organisational, not code, structure.
  • Load-bearing disclosures are at program-management altitude, not mechanism altitude. Metrics watched, pause thresholds, rollback triggers, phase structure, and orchestrator internals are not disclosed.
  • 90% is program-aggregate; per-project attribution is not given.
  • The manifesto itself is not public. The canonical disclosure names that a Deploy Safety Manifesto exists and has grown to cover "all Slack's deployment systems & processes" but the document isn't publicly available.
  • Post-Salesforce-acquisition context. Trust as #1 value is the closing frame — the program lives in a Salesforce-trust- value context.

Seen in

Last updated · 470 distilled / 1,213 read