Zalando — Rejecting Invalid Ingress Routes at Apply Time¶
Summary¶
Zalando runs Skipper as the default
Kubernetes ingress controller across 250+ clusters serving
15k+ Ingress objects, ~200k routes, and 500k–2M RPS. Skipper
extends the standard Ingress/RouteGroup model with
Skipper-specific predicates, filters, and backend syntax
that Kubernetes cannot validate — the API server happily accepts
syntactically valid YAML whose zalando.org/skipper-predicate
annotation references a predicate that doesn't exist, or whose
filter has the wrong argument type. At Zalando's scale, even 1%
invalid routes is real production risk: the manifest applies,
the broken route ships, and the failure surfaces later — in a
different place, to a different team, long after the change that
caused it. Roman Zavodskikh's 2026-04-08 post documents the fix:
a Kubernetes validating admission webhook
(ingress-admitter.teapot.zalan.do) that runs Skipper's own
filter registry, predicate specs, route validation, and backend
checks on the admission path, so kubectl apply rejects the
manifest immediately with an actionable error. The webhook
shipped behind the -enable-advanced-validation feature flag, was
rolled out tier-by-tier guided by a
skipper_route_invalid{route_id, reason} metric, and the
satisfying test came when engineers asked how to enable it and
the answer was "you don't need to — it's already on." The
implementation is in open-source Skipper v0.24.18.
Key takeaways¶
-
"Does the string parse?" is the wrong admission-time question for routing. Standard Kubernetes admission controllers can only say whether a manifest is syntactically valid; they cannot know whether the Skipper-specific predicates and filters referenced inside annotations exist or accept the arguments given. The right question is "would Skipper accept this route?" — which requires Skipper's own validation logic (Source: sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time).
-
Reuse the production validator as the admission-time validator. Rather than reimplement a checker, Zalando ran the same filter registry, predicate specifications, route validation, and backend parsers that Skipper uses at request time inside the webhook handler. The canonical architectural move is to treat the production validation stack as a library and call it from the admission path (patterns/reuse-runtime-logic-on-admission-path).
-
Failure stays attached to the change that caused it. Before the webhook,
kubectl applysucceeded and the broken route was discovered later in the routing layer by whoever was watching support channels. After the webhook, theapplyfails with the Skipper-specific error — the developer who made the change fixes it immediately, no runtime log archaeology (concepts/shift-left-validation). -
Actionable error text. The webhook propagates Skipper's own rejection message verbatim through Kubernetes's admission-denied response. Example: an
Ingresswithzalando.org/skipper-predicate: NonExistingPredicate()failskubectl applywith "invalid 'zalando.org/skipper-predicate' annotation: unknown_predicate: predicate 'NonExistingPredicate' not found" — the message tells the engineer what annotation is wrong and why, not just that admission was denied. -
Blast radius is on writes, not on traffic. A bad admission webhook doesn't drop live customer requests — it blocks
kubectl apply. That sounds benign but means blocked CI/CD pipelines, delayed service updates, and engineers unable to ship changes fleet-wide if the validator is wrong. The risk class is control-plane-at-the- write-path, not data-plane (concepts/control-plane-change-blast-radius). -
Feature flag as fast rollback.
-enable-advanced-validationsat in front of the new Skipper-specific validation so Zalando could turn it off per-cluster if false-positive rejections started blocking real valid configuration, without removing the webhook itself. During the rollout they did hit cases where valid routes were being rejected; the playbook was flag-off → fix the validator → flag-on again, not roll back the entire webhook (concepts/feature-flag-rollback-for-validator). -
skipper_route_invalid{route_id, reason}was the rollout signal. Before flipping the flag tier-up, Zalando watched this metric to distinguish real configuration mistakes (which the webhook should reject) from false positives in the validator (which meant the webhook was wrong). The metric was disclosed by name, not just as a category (concepts/invalid-route-observability-metric). -
Invisible rollout is the ideal outcome for control-plane changes. Zavodskikh presented this solution at an internal Zalando conference; the first audience question was how teams could enable it in their clusters. The answer was that they didn't need to — it was already enabled. Canonical shape for rolling out fleet-wide validation: default-on, seamless for users who weren't writing invalid manifests (patterns/invisible-rollout-via-default-on-validation).
-
Upstreamed, not Zalando-only. The validator lives in open-source Skipper (
github.com/zalando/skipper) from v0.24.18 onward, so any Skipper operator can enable-enable-advanced-validationin their own admission webhook deployment. Consistent with Zalando's pattern of keeping the Skipper ingress stack (routesrv, CHLB, bounded-load, admission validation) upstream rather than forked. -
Scale numbers as risk-framing. The post leads with "250+ Kubernetes clusters, 15k+ ingresses, ~200k routes, 500k-2M RPS" — and specifically with the 1%-invalid-routes is still 2,000 broken routes framing. The case for the webhook isn't made in the abstract; it's made with the observation that at this scale, noise floors become production risk.
Systems extracted¶
- Skipper — the HTTP router that Zalando ships as the K8s ingress proxy fleet-wide and that now also runs as the admission-webhook validator. The symmetry of "Skipper validates Skipper" is the whole architectural move: one codebase produces both the admission-time and request-time answers, so they cannot drift.
- Kubernetes — supplies the
ValidatingAdmissionWebhookprimitive: for matchingCREATE/UPDATEoperations onIngress/RouteGroup, the API server sends anAdmissionReviewto the webhook and honours its allow/deny verdict before persisting the object to etcd. The webhook mechanism is the substrate Zalando plugs Skipper's validator into.
Concepts extracted¶
- concepts/validating-admission-webhook — Kubernetes'
admissionregistration.k8s.io/v1primitive for synchronous allow/deny decisions on write-path API operations. Zalando canonical deployment: pods namedingress-admitter.teapot.zalan.do, mutating and validating phases in Kubernetes admission control covered in Kubernetes docs. - concepts/shift-left-validation — the general engineering stance of moving a correctness check from runtime to write-time (apply-time, build-time, commit-time). Zalando's webhook is the ingress-routing instance of this.
- concepts/feature-flag-rollback-for-validator — running a
new validator behind a feature flag (
-enable-advanced-validation) specifically so that false-positive rejections are a flag-flip away from reverting, not a code rollback. Sub-class of feature flag specialised for control- plane enforcement code. - concepts/invalid-route-observability-metric — the
skipper_route_invalid{route_id, reason}metric as the per-tier rollout gate: distinguishes the validator's correct rejections (real user mistakes) from wrong rejections (validator bug). Generalised: before shipping fleet-wide enforcement, ship the observability of what would be rejected first. - concepts/control-plane-change-blast-radius — specialised
framing of concepts/blast-radius for control-plane
changes: the worst failure mode of an overly-strict
admission webhook is not dropped customer requests (those
keep flowing on the old routing table) but a frozen write
path — CI/CD and manual operator
kubectl applyblocked fleet-wide. The risk class matters because it informs the rollback design (feature flag ≫ full rollback).
Patterns extracted¶
- patterns/reuse-runtime-logic-on-admission-path — run the same validator library the production data path uses when accepting/rejecting objects at admission time, so the two answers cannot drift. Zalando's canonical instance is Skipper validating Skipper; generalised form applies to any controller + CRD stack with its own DSL beyond Kubernetes's built-in schema validation.
- patterns/invisible-rollout-via-default-on-validation —
roll out fleet-wide validation such that teams writing valid
manifests observe no difference. When users ask "how do I
opt in?" the correct answer is "you don't."
Tier-by-tier rollout, advanced-validation feature flag, and
the
skipper_route_invalidobservability metric together make this invisible rollout safe. - patterns/feature-flagged-dual-implementation (existing) —
applied here to the validator itself: the webhook shipped
with both the old basic-syntactic-check path and the new
advanced-validation path available, toggled by
-enable-advanced-validation. Existing pattern; this ingress-validator use is a new Seen-in.
Operational numbers¶
| Quantity | Value | Context |
|---|---|---|
| Kubernetes clusters | 250+ | Zalando fleet running Skipper |
Ingress objects |
15k+ | Across the fleet |
| Eskip routes | ~200k | Derived from the Ingresses + RouteGroups |
| Peak RPS | 500k – 2M | Span of traffic envelope Skipper fronts |
| Invalid route tolerance | 1% → 2k broken | Zalando's framing that 1% at 200k is not background noise |
| Skipper release exposing the validator | v0.24.18 | Feature-flagged under -enable-advanced-validation |
| Admission-webhook service name | ingress-admitter.teapot.zalan.do |
Per the reproduced error message in the post |
Caveats and gaps¶
- Webhook latency and availability not disclosed. The post
does not quantify the P50/P99 latency that
ingress-admitter.teapot.zalan.doadds tokubectl apply, or the availability / replica count / failure-mode (failurePolicy: FailvsIgnore) configured on theValidatingWebhookConfiguration. For a webhook that gates all Ingress writes fleet-wide, these numbers matter for blast-radius analysis; their absence is a gap the Seen-in for concepts/validating-admission-webhook flags. - No false-positive rate numbers. The post says they
encountered cases where valid routes were rejected during
rollout, but doesn't quantify how many or from which
validator rules. The
-enable-advanced-validationflag is confirmed to have been needed; the frequency of flag-offs is not disclosed. - No quantified before/after on runtime incidents. The motivation — "people showed up in support channels asking why their requests don't work" — is the qualitative before-state; the post does not claim a specific reduction in routing-layer incidents attributable to the webhook. The value proposition is framed as moving the feedback loop, not as measurable incident reduction.
- No comparison with server-side apply / Gatekeeper / Kyverno. Zalando chose to run Skipper's own validator in a webhook rather than encode the rules in a policy engine like Kyverno or OPA/Gatekeeper. The trade-off is acknowledged implicitly (the whole post's thesis is that reusing Skipper's validator is better than reimplementing the rules elsewhere) but a policy-engine alternative is not named.
- No discussion of
RouteGroupvsIngressdifferential validation. Skipper reads both; the post shows anIngressexample but doesn't enumerate whether the webhook treats the two resources identically or has RouteGroup-specific validation. Small gap for a future deep-dive.
Source¶
- Original: https://engineering.zalando.com/posts/2026/04/skipper-validating-admission-webhook.html
- Raw markdown:
raw/zalando/2026-04-08-rejecting-invalid-ingress-routes-at-apply-time-974a460a.md
Related¶
- systems/skipper-proxy · systems/kubernetes · systems/zalando-route-server · companies/zalando
- concepts/validating-admission-webhook · concepts/shift-left-validation · concepts/feature-flag-rollback-for-validator · concepts/invalid-route-observability-metric · concepts/control-plane-change-blast-radius · concepts/feature-flag · concepts/blast-radius · concepts/control-plane-data-plane-separation
- patterns/reuse-runtime-logic-on-admission-path · patterns/invisible-rollout-via-default-on-validation · patterns/feature-flagged-dual-implementation · patterns/three-mode-rollout-off-shadow-exec