Skip to content

CONCEPT Cited by 1 source

Invalid route observability metric

Definition

An invalid route observability metric is a per-label-set counter emitted by a validator for every rejection, exposed before the validator's rejections actually block writes, so that operators can:

  1. See what would be rejected under the new rules without blocking anyone.
  2. Distinguish correct rejections (real user mistakes — ship the flag) from false positives (validator bugs — fix the validator first).
  3. Attribute rejections back to the specific manifest / owner (via a route_id label) and the specific rule (via a reason label) so the right team gets the right bug.

It is the observability half of the feature-flag rollback for validator pattern. The flag provides reversibility; the metric provides visibility into what reversibility would affect.

The canonical shape: skipper_route_invalid{route_id, reason}

Zalando's 2026-04 Skipper-webhook post names the metric explicitly:

"The most useful signal was skipper_route_invalid{route_id, reason}, which told me exactly which route failed validation and why. That made it much easier to distinguish real configuration mistakes from false positives in the validator."

The label design is load-bearing:

  • route_id — identifies which route is invalid, so the rejection can be traced back to an Ingress or RouteGroup object and its owner.
  • reason — identifies why it's invalid: unknown predicate, bad filter args, unparseable backend, etc. This is what lets an operator tell "this reason has been firing legitimately for the last three hours" from "this reason started firing when we turned the flag on so it's a validator bug."

A metric without reason collapses all rejection kinds into a single rate; you cannot tell signal from noise. A metric without route_id reports a rate but no handle to fix the underlying object.

The rollout loop it enables

1. Deploy webhook with advanced validation BEHIND a flag (off)
2. Emit skipper_route_invalid{route_id, reason} for every route
   the advanced validator WOULD reject (shadow mode — no writes
   actually blocked)
3. For each reason that fires:
   - Is it a real bug in a team's manifest? → reach out, fix
   - Is it a false positive? → fix the validator
4. When the reason signal is clean (reasons firing = reasons
   you expect), flip the flag on for tier 1
5. Watch the webhook's actual rejection rate match
   skipper_route_invalid — the same curve, just now it's
   blocking the write
6. Repeat tier-by-tier across the fleet

This is the shadow → enforce pattern (see patterns/shadow-mode-alert-before-paging, patterns/three-mode-rollout-off-shadow-exec, concepts/shadow-mode-alert-validation) applied specifically to admission-time enforcement.

Why the metric is necessary, not optional, for fleet-scale rollout

At a handful of clusters you can eyeball rejections in webhook logs. At 250+ clusters, 15k+ ingresses, ~200k routes:

  • There are too many routes for visual inspection; any rollout decision has to be made from aggregated signal, not by reading every rejected object.
  • The cost of an unknown false positive is high and diffuse — some team's CI starts failing in a region you don't operate in; they file a ticket; you backtrack. The metric fires the alarm first.
  • The cost of an unknown true positive is a team's route was broken and they didn't know; the metric tells them before the new enforcement blocks them, so the failure isn't a surprise.

Without the metric, you're rolling out blind: the only signal is the rejection itself, and rejections are user-visible only after the flag is already on.

Generalisation beyond ingress validation

The same shape applies anywhere a new enforcement rule is being added to a write path:

  • New IAM policy evaluator → emit policy_violation{principal, rule} before enforcing denials.
  • New schema linter → emit schema_lint_fail{file, rule} before blocking CI.
  • New SQL guard → emit query_reject{app, reason} before rejecting.

The label design pattern — {actor_or_object, rule_or_reason} — is transferable. It maps one-to-one to the two questions an operator asks during rollout: whose code is this about and which rule fired.

Seen in

Last updated · 507 distilled / 1,218 read