CONCEPT Cited by 1 source
Invalid route observability metric¶
Definition¶
An invalid route observability metric is a per-label-set counter emitted by a validator for every rejection, exposed before the validator's rejections actually block writes, so that operators can:
- See what would be rejected under the new rules without blocking anyone.
- Distinguish correct rejections (real user mistakes — ship the flag) from false positives (validator bugs — fix the validator first).
- Attribute rejections back to the specific manifest / owner
(via a
route_idlabel) and the specific rule (via areasonlabel) so the right team gets the right bug.
It is the observability half of the feature-flag rollback for validator pattern. The flag provides reversibility; the metric provides visibility into what reversibility would affect.
The canonical shape: skipper_route_invalid{route_id, reason}¶
Zalando's 2026-04 Skipper-webhook post names the metric explicitly:
"The most useful signal was
skipper_route_invalid{route_id, reason}, which told me exactly which route failed validation and why. That made it much easier to distinguish real configuration mistakes from false positives in the validator."
The label design is load-bearing:
route_id— identifies which route is invalid, so the rejection can be traced back to anIngressorRouteGroupobject and its owner.reason— identifies why it's invalid: unknown predicate, bad filter args, unparseable backend, etc. This is what lets an operator tell "this reason has been firing legitimately for the last three hours" from "this reason started firing when we turned the flag on so it's a validator bug."
A metric without reason collapses all rejection kinds into a
single rate; you cannot tell signal from noise. A metric
without route_id reports a rate but no handle to fix the
underlying object.
The rollout loop it enables¶
1. Deploy webhook with advanced validation BEHIND a flag (off)
2. Emit skipper_route_invalid{route_id, reason} for every route
the advanced validator WOULD reject (shadow mode — no writes
actually blocked)
3. For each reason that fires:
- Is it a real bug in a team's manifest? → reach out, fix
- Is it a false positive? → fix the validator
4. When the reason signal is clean (reasons firing = reasons
you expect), flip the flag on for tier 1
5. Watch the webhook's actual rejection rate match
skipper_route_invalid — the same curve, just now it's
blocking the write
6. Repeat tier-by-tier across the fleet
This is the shadow → enforce pattern (see patterns/shadow-mode-alert-before-paging, patterns/three-mode-rollout-off-shadow-exec, concepts/shadow-mode-alert-validation) applied specifically to admission-time enforcement.
Why the metric is necessary, not optional, for fleet-scale rollout¶
At a handful of clusters you can eyeball rejections in webhook logs. At 250+ clusters, 15k+ ingresses, ~200k routes:
- There are too many routes for visual inspection; any rollout decision has to be made from aggregated signal, not by reading every rejected object.
- The cost of an unknown false positive is high and diffuse — some team's CI starts failing in a region you don't operate in; they file a ticket; you backtrack. The metric fires the alarm first.
- The cost of an unknown true positive is a team's route was broken and they didn't know; the metric tells them before the new enforcement blocks them, so the failure isn't a surprise.
Without the metric, you're rolling out blind: the only signal is the rejection itself, and rejections are user-visible only after the flag is already on.
Generalisation beyond ingress validation¶
The same shape applies anywhere a new enforcement rule is being added to a write path:
- New IAM policy evaluator → emit
policy_violation{principal, rule}before enforcing denials. - New schema linter → emit
schema_lint_fail{file, rule}before blocking CI. - New SQL guard → emit
query_reject{app, reason}before rejecting.
The label design pattern — {actor_or_object, rule_or_reason}
— is transferable. It maps one-to-one to the two questions an
operator asks during rollout: whose code is this about and
which rule fired.
Seen in¶
- sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time
— canonical instance. Zavodskikh names
skipper_route_invalid{route_id, reason}as the "most useful signal" for the tier-by-tier rollout of the Skipper admission webhook across 250+ Kubernetes clusters. The metric distinguished "real configuration mistakes from false positives in the validator," making it the gate on flipping the-enable-advanced-validationflag per tier. Paired with the feature-flag lever (concepts/feature-flag-rollback-for-validator) and the reuse-of-runtime-validator architectural move (patterns/reuse-runtime-logic-on-admission-path) to make the invisible rollout safe.
Related¶
- concepts/validating-admission-webhook · concepts/feature-flag-rollback-for-validator · concepts/shift-left-validation · concepts/shadow-mode-alert-validation
- patterns/shadow-mode-alert-before-paging · patterns/three-mode-rollout-off-shadow-exec · patterns/invisible-rollout-via-default-on-validation · patterns/reuse-runtime-logic-on-admission-path