CONCEPT Cited by 1 source

Feature flag rollback for validator¶

Definition¶

Feature flag rollback for validator is the discipline of running a new or stricter validator (typically at admission time, CI, or some write-path gate) behind a feature flag specifically so that false-positive rejections — cases where the validator rejects a valid input — can be reverted by flipping the flag off, without reverting the surrounding service to an older binary.

It is a specialisation of the general feature flag pattern applied to the narrow case of enforcement code, where the failure mode of a bug is "valid manifests stop being accepted" rather than "a feature doesn't work for some users."

Why validators need their own fast-rollback lever¶

A validator that is too lenient lets through bad input — the same bad input the system would have accepted without the validator. Risk is flat against baseline.

A validator that is too strict blocks valid input — worse than baseline, because engineers who could ship before cannot ship now. At fleet scale this means blocked CI/CD pipelines and stalled operator workflows across many teams at once (concepts/control-plane-change-blast-radius). The blast radius asymmetry is the reason the flag exists: the over- strict failure mode is far worse than the original problem the validator was supposed to fix.

A full binary rollback is slow — hours, not seconds — and incurs a redeploy to every cluster. A flag flip:

Takes effect in seconds.
Keeps the webhook/validator infrastructure running (so the "basic" validation that's still useful stays on).
Leaves the new code path in place so you can fix it and re-enable without redeploying.

Zalando's `-enable-advanced-validation` as the canonical example¶

The 2026-04-08 Zalando Skipper validating-admission-webhook post documents this pattern by name:

"We also kept advanced validation behind the -enable-advanced-validation feature flag. That gave us a fast rollback path without removing the webhook itself. During the rollout, we did encounter cases where some routes were rejected even though they should have been accepted. In those cases, we turned advanced validation off, fixed the issues, and continued the rollout once the behavior was correct again."

Structure:

The webhook itself is always deployed as a Kubernetes ValidatingWebhookConfiguration (a concepts/validating-admission-webhook pointed at the Skipper-hosted handler).
The old, basic, mostly-syntactic validation path is always on — it was there before this work shipped.
The new, Skipper-specific, advanced validation path (filter existence, predicate-arg type check, backend parse) is gated by -enable-advanced-validation.
Rollout is flag-off → turn flag on in tier 1 → watch skipper_route_invalid metric → turn on in tier 2 → ... → tier N. Any unexpected rejection → flag back off → fix the validator → resume.

The three-level decomposition matters: if the flag had gated the whole webhook, flipping it off would revert to no write-path validation at all; by gating only the advanced layer, the baseline validation the fleet already had stays on during an advanced-validation rollback.

Versus other rollback mechanisms¶

Binary rollback — redeploy the prior Skipper binary fleet-wide. Slow, affects data-plane too (Skipper also serves live traffic in non-webhook mode), wide blast radius.
failurePolicy: Ignore — Kubernetes mechanism that makes webhook failures permissive (Ignore accepts writes when the webhook is down or returns errors). This is not a rollback; it's an availability-vs-correctness trade-off applied globally to the webhook. Toggling to Ignore means all validation is bypassed when there's any error, not just the advanced layer.
MatchPolicy / selector scoping — narrow the webhook to fewer resources. Useful to reduce blast radius up front but not a rollback lever once the scope is already wide.
Admission-skip annotations — allow an individual object to bypass validation via annotation. Per-object, not per-rule; requires every impacted manifest to change.

The feature flag sits at the right granularity: per-validator- rule-set, cluster-wide, flip-in-seconds.

Design rules this implies¶

Separate the new validation layer from the old one in code. The flag check must be at the boundary of the new logic so flipping it cleanly disables only the new rejections.
The old path must still work correctly when the new path is off. No cross-contamination of state.
Emit observability for both paths. You need to see what would be rejected by the new validator even when the flag is off, so you can tune the validator before flipping it on — this is the role skipper_route_invalid plays in the Zalando rollout.
Document what "on" means per cluster. In a 250+ cluster fleet, tracking which clusters are at what tier is a deployment-config concern; the flag belongs in the same config layer as the rest of the Skipper configuration.

Seen in¶

sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time — canonical instance. The -enable-advanced-validation flag on Skipper's admission webhook is the lever Zavodskikh names as "a fast rollback path without removing the webhook itself." The rollout playbook explicitly uses the flag: when valid routes were being rejected, they turned advanced validation off, fixed the validator, then continued the rollout. Generalises the feature-flag pattern into the narrow enforcement-code rollback shape.