CONCEPT Cited by 1 source
Pre-10% fleet detection goal¶
Definition¶
The pre-10% fleet detection goal is the deploy-safety North Star target that problematic deployments must be detected before reaching more than 10% of the production fleet. The number caps the blast radius of a bad change at 10% of user traffic (or 10% of host instances, depending on definition).
The goal is a rollout-gating target, not a technique: achieving it typically requires a composition of staged rollout, cohort-percentage rollout, fast per-phase observability signals, and a feedback control loop that can actually halt the rollout under 10%.
Canonical disclosure¶
Slack's 2025-10-07 Deploy Safety retrospective canonicalises the goal alongside two MTTR goals (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):
"Reducing severity of impact — Detect problematic deployments prior to reaching 10% of the fleet"
Paired with:
- "Automated detection & remediation within 10 minutes"
- "Manual detection & remediation within 20 minutes"
Together, the three form Slack's Deploy Safety Manifesto North Star goals.
Why 10% rather than some other number¶
The 10% number is organisation-specific. Slack's choice is implicit but the structural argument is general:
- 1% is typical canary cohort. 1% detection means catching the regression in the canary before graduating to the next phase — very aggressive; only achievable when the canary signal is sensitive enough to detect customer impact at that scale.
- 10% is the typical "beta" or "wave 1" phase in a 1% → 10% → 50% → 100% rollout. 10% means catching the regression by the end of the beta phase.
- 50% is typical mid-rollout. A 50%-detection goal allows the rollout to reach half of production before gating — aggressive-enough-for-most-orgs but low-bar for high-impact services.
Slack's choice of 10% explicitly targets "reducing severity of impact" — the 10% cap bounds how bad a regression can get before detection, not whether a regression happens at all.
Composition with MTTR goals¶
The goal composes with the two MTTR North Stars to form a triangle:
Interpretation:
- 10-min auto MTTR bounds the worst-case impact given early detection.
- 20-min manual MTTR bounds the worst-case impact when automation isn't in place (human-mediated remediation).
- 10% fleet detection bounds the worst-case exposure scope, independent of the remediation speed.
Three independent axes, all with quantitative targets.
Required substrate¶
Achieving the goal requires, at minimum:
- Phased rollout substrate. Deploy must move through discrete cohorts (1% → 10% → …) with pause points.
- Fast detection signal. Metrics must surface customer impact within seconds to minutes — not tens of minutes or hours. A 15-minute aggregation window means the rollout has already moved to 20%+ exposure by the time the signal arrives.
- Halt mechanism. The rollout tooling must be able to halt between phases (automatically on metric-alarm, or manually on human decision).
- Customer-aligned signal. Generic error-rate alarms can miss per-customer-segment impact. The signal must align with what actually matters to customers.
See concepts/feedback-control-loop-for-rollouts for the full feedback-loop framing.
Relationship to blast-radius framing¶
The pre-10% goal is a fleet-level blast-radius cap. It fits into the wiki's blast-radius taxonomy at the deployment-cohort altitude (below AZ / region / tenant; above request).
See concepts/blast-radius for the full boundary hierarchy.
Related framings elsewhere in the wiki:
- concepts/active-multi-cluster-blast-radius — cluster- level blast-radius; multiple active clusters bound the impact of any single cluster's bad deploy.
- patterns/cell-based-architecture-for-blast-radius-reduction — cell-level bound; each cell serves a capped fraction of customers, so a cell-level outage caps impact at that fraction.
Operational implications¶
- Observability investment is prerequisite. Slack's Webapp backend investment sequence began with "A quarter of work to engineer automatic metric monitoring" before any rollout- gating was possible. The metric-monitoring quality governs how early in the rollout you can detect.
- Small rollout phases are load-bearing. A 1% → 100% rollout cannot meet the goal — the phase step is too large.
- Customer-segment sensitivity. A 10% cap on overall fleet exposure still permits 100% exposure for a specific customer segment if the deploy cohort happens to be correlated with customer segmentation. Rollout phase design should explicitly mix segments.
- The goal is per-change. Not every change needs to be rollout-gated to the same degree — emergency security patches may need faster-than-10% promotion; low-risk configuration changes may not need rollout at all.
Caveats¶
- Detection before 10% is a rollout-cadence goal, not an exposure-cap guarantee. If detection fails (metric misses the regression), the rollout continues past 10%. The goal bounds the good case, not the bad.
- 10% can still be a lot. For a service with 100M DAU, 10% is 10M impacted users.
- The metric is per-deploy, not per-incident. A small percentage of bad deploys may still reach 100% — the goal is an aspirational target, not a hard gate.
- Customer-impact visibility at 10%. The goal presumes customer impact is detectable at 10%. For some failure modes (slow memory leak, batch-job-only regression), it may not be.
Seen in¶
- sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — canonical statement of the 10% fleet-exposure detection goal as one of three North Star deploy-safety goals at Slack.