PATTERN Cited by 1 source
Cohort percentage rollout with explicit inclusion criteria¶
Specialisation of patterns/staged-rollout for fleet-wide enforcement rollouts (binary authorization, new auth model, mandatory MFA, stricter egress policy) where the per-device risk of blowing up productivity varies a lot across the population. Rather than picking cohort members randomly or by team, sort the fleet by a predictable-risk metric and advance cohorts in order from lowest-risk to highest-risk.
Shape¶
- Pick a risk metric derived from telemetry — usually the count of events that would have been blocked under the new enforcement, observed during a monitoring-mode precursor. Examples:
- Binary authz: "unique
UNKNOWNbinaries this user executed in the last 30 days". - Stricter egress: "unique denied domains this user's traffic hit in the last 30 days".
- MFA on internal apps: "unique auth sessions per day".
- Define cohort inclusion criteria explicitly — not just percentages. "10% with zero unknown binaries" ≠ "10% of users". The gate isn't just N%, it's which N%.
- Gate cohort-to-cohort advancement on in-cohort signal. Enter the next cohort only after the current cohort's block rate is acceptable. The enforcement path has the instrument (block events) built in, so this is a measured handoff.
- Include new hires at a specific cohort threshold. Pick a point (e.g., 70%) where new hires enter enforcement mode on day one — keeps the remaining cohort small and avoids the "new hire in monitoring mode while their team is in lockdown" asymmetry.
- Pause before 100% for a hardening sweep. At ~98%, stop advancing, investigate the tail (users who hit the escape hatch, users with unresolved blocks, users whose workflows need bespoke rule scoping), and add rules rather than force the tail into enforcement with broken rulesets.
- Provide a revert path during rollout — individual users can flip their device back to monitoring mode via a known command if lockdown breaks them.
- Retire the escape hatch at 100% — it's a rollout tool, not a steady-state bypass.
Why this shape¶
- Risk metric = pre-enforcement observable signal. The rollout never surprises you with an unexpected cohort of heavy-UNKNOWN users hitting lockdown — you already know who they are and you sequence them last.
- Advancing by risk tier surfaces highest-leverage rule work first. Solving the 10% cohort's small issues is cheap; the final 30% (engineers / data scientists / anyone with specialist toolchains) needs structural infra work (group-scoped rules, Compiler rule expansions, etc.) and wants the full prior rollout's data feeding into it.
- Pause-at-98% is critical. Forcing the tail into lockdown on an unfinished ruleset is exactly the kind of productivity-freeze the rollout was supposed to avoid.
When this applies vs plain staged rollout¶
- Plain staged rollout — uniform population, random or round-robin cohort selection, percentage thresholds are about blast-radius. The change doesn't interact with per-user behaviour much.
- Cohort percentage rollout with inclusion criteria — heterogeneous population, per-user risk is knowable from telemetry, cohort composition is a correctness decision not just a blast-radius decision. The change interacts with per-user behaviour in predictable ways.
The two aren't mutually exclusive — you can do canary-cell pod-% staged rollouts for the enforcement code itself (Santa agent update) while doing cohort-percentage rollout for the policy switch (lockdown vs monitoring).
Seen in¶
- sources/2026-04-21-figma-rolling-out-santa-without-freezing-productivity — Figma's three-month Santa rollout to 100% of the macOS fleet:
- 10%: users with zero unknown binaries last month
- 25% / 50% / 70%: users with <10 unique unknown binaries; all new hires enter lockdown at 70%
- Final 30%: engineers + data scientists, whose toolchain
(Anaconda,
codesign --sign -) produces per-machine unique binary hashes that fleet-wide permissive rules would over-admit — addressed by building group-based rule scoping in the sync server - 98%: pause one month, investigate
/santa disableusers and >10-unknown machines, add group/personal rules as needed - 100%: retire
/santa disable, promote lockdown into the Endpoint Security Baseline Canonical wiki instance.
Sibling patterns¶
- patterns/shadow-application-readiness — measures the residual blast-radius before rollout by running the new enforcer against captured traffic; this pattern is the rollout-time companion.
- patterns/golden-path-with-escapes — end-state shape: the global ruleset is the golden path, group-scoped / personal rules are the escapes.
- patterns/weighted-dns-traffic-shifting — percentage-weighted rollout at the traffic layer (service migration) rather than at the user/device layer; same skeleton, different target population.
Related¶
- patterns/staged-rollout — the general version this specialises.
- patterns/rollout-escape-hatch — mandatory paired pattern during rollout.
- patterns/data-driven-allowlist-monitoring-mode — where the risk metric comes from.
- patterns/fast-rollback — the other side of blast-radius bounding.