PATTERN Cited by 1 source

Invisible rollout via default-on validation¶

Problem¶

A platform team is adding fleet-wide correctness enforcement — a new validating admission webhook, a schema linter, a policy check — that rejects bad inputs many teams could theoretically be writing. The rollout question is "how do we get 100% of teams to adopt this?"

The two conventional answers both have failure modes:

Opt-in rollout — announce the new validation, document how to enable it, let teams turn it on when they're ready. In practice, adoption tops out at the teams who already cared. The teams most likely to write invalid configuration are the teams least likely to opt in. Invalid-route rate barely moves.
Hard cutover rollout — schedule a date, cut to enforcing mode fleet-wide, send announcements. Breaks teams who hadn't noticed the announcement; generates support load; creates an adversarial dynamic ("the platform team broke us"). Generates a lot of pressure to roll the whole thing back.

Both miss what's actually true: if the new validation rules are correct, most teams' manifests already pass them. Only teams writing invalid inputs should see any difference. The rollout shape should match that reality.

Solution¶

Roll out the validation enabled-by-default, cluster-by- cluster, so that the experience for any team writing valid manifests is zero: they don't opt in, they don't change anything, they don't even know the rollout happened. The only teams who notice are the ones whose manifests were invalid — and the error message the webhook returns tells them what to fix.

Operational structure:

Ship the webhook + validator with the enforcement code behind a feature flag (e.g. -enable-advanced-validation). Flag defaults to off.
Emit an invalid-route observability metric (skipper_route_invalid{route_id, reason} in the Skipper case) for every rejection the validator would make. In shadow mode — no writes blocked. This is the shadow phase.
For each tier (dev / staging / smallest-prod-cluster- group / larger-cluster-group / rest-of-fleet), watch the shadow metric:
If reason values look like real user mistakes in that tier's traffic → the validator is correct for that tier → enable the flag in that tier.
If reason values look like false positives → fix the validator → return to shadow mode in that tier.
When the flag is enabled in a tier, teams writing valid manifests see no change (their applys still succeed), teams writing invalid manifests see the rejection with the specific error ("predicate 'Foo' not found"). No announcement needed, no opt-in paperwork, no migration guide.
Continue tier-by-tier to the full fleet.

The test the pattern should pass: at a later moment (team offsite, internal talk, architecture review), when someone asks "how do we enable this?", the honest answer is "you don't need to, it's already on."

What makes it work¶

The validator must be correct to a very high percentile. If 5% of valid manifests are false-positively rejected, "invisible" becomes "fleet-wide surprise incident." This is why this pattern pairs with patterns/reuse-runtime-logic-on-admission-path — the only way to make the admission validator match production behaviour at this precision is to use the same validator.
A feature flag is available for unexpected problems. If a tier is enabled and a false positive emerges anyway (something the shadow metric didn't catch), the flag-off recovery is seconds, not hours (concepts/feature-flag-rollback-for-validator).
Tier-by-tier, not simultaneous. Full-fleet enablement violates the invisibility guarantee: any true positive in any cluster's traffic patterns becomes a rejection at the same instant. Tier-by-tier gives you time to observe one tier's reaction before moving on.
Error messages that name the fix. "Validation failed" breaks the invisibility because now teams have to come ask you what went wrong. "Predicate 'NonExistingPredicate' not found" lets them fix it in place.
The failure mode is already a known incident class for the affected teams. The teams whose manifests were rejected were already going to have a broken route — now they get the error at apply time instead of in a support channel days later. From their perspective, it's a faster feedback loop, not a new problem.

Contrasts with other rollout shapes¶

patterns/three-mode-rollout-off-shadow-exec — a closely related shape from Zalando's 2025-02 routesrv rollout. Three-mode is off / shadow / exec (explicit shadow phase with active comparison to the old behaviour before switching). Invisible-rollout-via-default-on can use a three-mode flag mechanism, but its distinctive property is the user experience of the result: silence for correctness, helpful rejection for mistakes, and no announcement layer.
patterns/phased-rollout-across-release-channels — stable / canary / beta release channels for a library or service; users adopt by upgrading. Opt-in by design; not invisible in the same way.
patterns/staged-rollout — generic staged rollout; doesn't specify whether users have to do anything to get the new behaviour.
patterns/shadow-mode-alert-before-paging — the shadow phase only; deciding when to move from "alarm on this signal" to "page on this signal". Can be a building block for this pattern.

When not to use¶

The validator rejects legitimate-today behaviour to enforce a future standard. That's a migration, not a bug catch. Users need to change their manifests. That needs opt-in, announcements, migration guides. Invisible rollout is for "this was always supposed to be wrong, and now we catch it."
The cost of a false-positive rejection is very high (financial transaction blocked, safety check blocked). The pattern's silence is a feature only when the corresponding failure class is "team's deploy fails, they see why, they fix it." If the failure class is "transactions stop processing revenue," even low false-positive rates are unacceptable.
You need proof of adoption for compliance. Some regulatory regimes want an audit trail of teams actively opting in to a control. Invisible rollout leaves no such trail; active adoption records do.

Seen in¶

sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time — canonical instance. Zavodskikh presented Skipper's validating-admission-webhook solution at an internal Zalando conference; the first audience question was how teams could enable it in their clusters. "The satisfying part was answering that they did not need to do anything, because it was already enabled. That is probably the best possible result for this kind of rollout." The rollout discipline that got them there: reuse Skipper's own validator as the admission validator so the error-rate was low enough to be invisible to valid-manifest authors (patterns/reuse-runtime-logic-on-admission-path); feature-flagged with -enable-advanced-validation for fast rollback (concepts/feature-flag-rollback-for-validator); tier-by-tier enablement guided by the skipper_route_invalid{route_id, reason} metric (concepts/invalid-route-observability-metric); and a rejection message that names the specific filter / predicate / backend problem so teams can fix in place.