CONCEPT Cited by 1 source

Pre-10% fleet detection goal¶

Definition¶

The pre-10% fleet detection goal is the deploy-safety North Star target that problematic deployments must be detected before reaching more than 10% of the production fleet. The number caps the blast radius of a bad change at 10% of user traffic (or 10% of host instances, depending on definition).

The goal is a rollout-gating target, not a technique: achieving it typically requires a composition of staged rollout, cohort-percentage rollout, fast per-phase observability signals, and a feedback control loop that can actually halt the rollout under 10%.

Canonical disclosure¶

Slack's 2025-10-07 Deploy Safety retrospective canonicalises the goal alongside two MTTR goals (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

"Reducing severity of impact — Detect problematic deployments prior to reaching 10% of the fleet"

Paired with:

"Automated detection & remediation within 10 minutes"
"Manual detection & remediation within 20 minutes"

Together, the three form Slack's Deploy Safety Manifesto North Star goals.

Why 10% rather than some other number¶

The 10% number is organisation-specific. Slack's choice is implicit but the structural argument is general:

1% is typical canary cohort. 1% detection means catching the regression in the canary before graduating to the next phase — very aggressive; only achievable when the canary signal is sensitive enough to detect customer impact at that scale.
10% is the typical "beta" or "wave 1" phase in a 1% → 10% → 50% → 100% rollout. 10% means catching the regression by the end of the beta phase.
50% is typical mid-rollout. A 50%-detection goal allows the rollout to reach half of production before gating — aggressive-enough-for-most-orgs but low-bar for high-impact services.

Slack's choice of 10% explicitly targets "reducing severity of impact" — the 10% cap bounds how bad a regression can get before detection, not whether a regression happens at all.

Composition with MTTR goals¶

The goal composes with the two MTTR North Stars to form a triangle:

     10-min automated MTTR
    /
   /
  10% fleet detection
   \
    \
     20-min manual MTTR

Interpretation:

10-min auto MTTR bounds the worst-case impact given early detection.
20-min manual MTTR bounds the worst-case impact when automation isn't in place (human-mediated remediation).
10% fleet detection bounds the worst-case exposure scope, independent of the remediation speed.

Three independent axes, all with quantitative targets.

Required substrate¶

Achieving the goal requires, at minimum:

Phased rollout substrate. Deploy must move through discrete cohorts (1% → 10% → …) with pause points.
Fast detection signal. Metrics must surface customer impact within seconds to minutes — not tens of minutes or hours. A 15-minute aggregation window means the rollout has already moved to 20%+ exposure by the time the signal arrives.
Halt mechanism. The rollout tooling must be able to halt between phases (automatically on metric-alarm, or manually on human decision).
Customer-aligned signal. Generic error-rate alarms can miss per-customer-segment impact. The signal must align with what actually matters to customers.

See concepts/feedback-control-loop-for-rollouts for the full feedback-loop framing.

Relationship to blast-radius framing¶

The pre-10% goal is a fleet-level blast-radius cap. It fits into the wiki's blast-radius taxonomy at the deployment-cohort altitude (below AZ / region / tenant; above request).

See concepts/blast-radius for the full boundary hierarchy.

Related framings elsewhere in the wiki:

concepts/active-multi-cluster-blast-radius — cluster- level blast-radius; multiple active clusters bound the impact of any single cluster's bad deploy.
patterns/cell-based-architecture-for-blast-radius-reduction — cell-level bound; each cell serves a capped fraction of customers, so a cell-level outage caps impact at that fraction.

Operational implications¶

Observability investment is prerequisite. Slack's Webapp backend investment sequence began with "A quarter of work to engineer automatic metric monitoring" before any rollout- gating was possible. The metric-monitoring quality governs how early in the rollout you can detect.
Small rollout phases are load-bearing. A 1% → 100% rollout cannot meet the goal — the phase step is too large.
Customer-segment sensitivity. A 10% cap on overall fleet exposure still permits 100% exposure for a specific customer segment if the deploy cohort happens to be correlated with customer segmentation. Rollout phase design should explicitly mix segments.
The goal is per-change. Not every change needs to be rollout-gated to the same degree — emergency security patches may need faster-than-10% promotion; low-risk configuration changes may not need rollout at all.

Caveats¶

Detection before 10% is a rollout-cadence goal, not an exposure-cap guarantee. If detection fails (metric misses the regression), the rollout continues past 10%. The goal bounds the good case, not the bad.
10% can still be a lot. For a service with 100M DAU, 10% is 10M impacted users.
The metric is per-deploy, not per-incident. A small percentage of bad deploys may still reach 100% — the goal is an aspirational target, not a hard gate.
Customer-impact visibility at 10%. The goal presumes customer impact is detectable at 10%. For some failure modes (slow memory leak, batch-job-only regression), it may not be.

Seen in¶

sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change — canonical statement of the 10% fleet-exposure detection goal as one of three North Star deploy-safety goals at Slack.