Skip to content

CONCEPT Cited by 1 source

Pre-10% fleet detection goal

Definition

The pre-10% fleet detection goal is the deploy-safety North Star target that problematic deployments must be detected before reaching more than 10% of the production fleet. The number caps the blast radius of a bad change at 10% of user traffic (or 10% of host instances, depending on definition).

The goal is a rollout-gating target, not a technique: achieving it typically requires a composition of staged rollout, cohort-percentage rollout, fast per-phase observability signals, and a feedback control loop that can actually halt the rollout under 10%.

Canonical disclosure

Slack's 2025-10-07 Deploy Safety retrospective canonicalises the goal alongside two MTTR goals (Source: sources/2025-10-07-slack-deploy-safety-reducing-customer-impact-from-change):

"Reducing severity of impact — Detect problematic deployments prior to reaching 10% of the fleet"

Paired with:

  • "Automated detection & remediation within 10 minutes"
  • "Manual detection & remediation within 20 minutes"

Together, the three form Slack's Deploy Safety Manifesto North Star goals.

Why 10% rather than some other number

The 10% number is organisation-specific. Slack's choice is implicit but the structural argument is general:

  • 1% is typical canary cohort. 1% detection means catching the regression in the canary before graduating to the next phase — very aggressive; only achievable when the canary signal is sensitive enough to detect customer impact at that scale.
  • 10% is the typical "beta" or "wave 1" phase in a 1% → 10% → 50% → 100% rollout. 10% means catching the regression by the end of the beta phase.
  • 50% is typical mid-rollout. A 50%-detection goal allows the rollout to reach half of production before gating — aggressive-enough-for-most-orgs but low-bar for high-impact services.

Slack's choice of 10% explicitly targets "reducing severity of impact" — the 10% cap bounds how bad a regression can get before detection, not whether a regression happens at all.

Composition with MTTR goals

The goal composes with the two MTTR North Stars to form a triangle:

     10-min automated MTTR
    /
   /
  10% fleet detection
   \
    \
     20-min manual MTTR

Interpretation:

  • 10-min auto MTTR bounds the worst-case impact given early detection.
  • 20-min manual MTTR bounds the worst-case impact when automation isn't in place (human-mediated remediation).
  • 10% fleet detection bounds the worst-case exposure scope, independent of the remediation speed.

Three independent axes, all with quantitative targets.

Required substrate

Achieving the goal requires, at minimum:

  1. Phased rollout substrate. Deploy must move through discrete cohorts (1% → 10% → …) with pause points.
  2. Fast detection signal. Metrics must surface customer impact within seconds to minutes — not tens of minutes or hours. A 15-minute aggregation window means the rollout has already moved to 20%+ exposure by the time the signal arrives.
  3. Halt mechanism. The rollout tooling must be able to halt between phases (automatically on metric-alarm, or manually on human decision).
  4. Customer-aligned signal. Generic error-rate alarms can miss per-customer-segment impact. The signal must align with what actually matters to customers.

See concepts/feedback-control-loop-for-rollouts for the full feedback-loop framing.

Relationship to blast-radius framing

The pre-10% goal is a fleet-level blast-radius cap. It fits into the wiki's blast-radius taxonomy at the deployment-cohort altitude (below AZ / region / tenant; above request).

See concepts/blast-radius for the full boundary hierarchy.

Related framings elsewhere in the wiki:

Operational implications

  • Observability investment is prerequisite. Slack's Webapp backend investment sequence began with "A quarter of work to engineer automatic metric monitoring" before any rollout- gating was possible. The metric-monitoring quality governs how early in the rollout you can detect.
  • Small rollout phases are load-bearing. A 1% → 100% rollout cannot meet the goal — the phase step is too large.
  • Customer-segment sensitivity. A 10% cap on overall fleet exposure still permits 100% exposure for a specific customer segment if the deploy cohort happens to be correlated with customer segmentation. Rollout phase design should explicitly mix segments.
  • The goal is per-change. Not every change needs to be rollout-gated to the same degree — emergency security patches may need faster-than-10% promotion; low-risk configuration changes may not need rollout at all.

Caveats

  • Detection before 10% is a rollout-cadence goal, not an exposure-cap guarantee. If detection fails (metric misses the regression), the rollout continues past 10%. The goal bounds the good case, not the bad.
  • 10% can still be a lot. For a service with 100M DAU, 10% is 10M impacted users.
  • The metric is per-deploy, not per-incident. A small percentage of bad deploys may still reach 100% — the goal is an aspirational target, not a hard gate.
  • Customer-impact visibility at 10%. The goal presumes customer impact is detectable at 10%. For some failure modes (slow memory leak, batch-job-only regression), it may not be.

Seen in

Last updated · 470 distilled / 1,213 read