Skip to content

ZALANDO 2026-04-08

Read original ↗

Zalando — Rejecting Invalid Ingress Routes at Apply Time

Summary

Zalando runs Skipper as the default Kubernetes ingress controller across 250+ clusters serving 15k+ Ingress objects, ~200k routes, and 500k–2M RPS. Skipper extends the standard Ingress/RouteGroup model with Skipper-specific predicates, filters, and backend syntax that Kubernetes cannot validate — the API server happily accepts syntactically valid YAML whose zalando.org/skipper-predicate annotation references a predicate that doesn't exist, or whose filter has the wrong argument type. At Zalando's scale, even 1% invalid routes is real production risk: the manifest applies, the broken route ships, and the failure surfaces later — in a different place, to a different team, long after the change that caused it. Roman Zavodskikh's 2026-04-08 post documents the fix: a Kubernetes validating admission webhook (ingress-admitter.teapot.zalan.do) that runs Skipper's own filter registry, predicate specs, route validation, and backend checks on the admission path, so kubectl apply rejects the manifest immediately with an actionable error. The webhook shipped behind the -enable-advanced-validation feature flag, was rolled out tier-by-tier guided by a skipper_route_invalid{route_id, reason} metric, and the satisfying test came when engineers asked how to enable it and the answer was "you don't need to — it's already on." The implementation is in open-source Skipper v0.24.18.

Key takeaways

  1. "Does the string parse?" is the wrong admission-time question for routing. Standard Kubernetes admission controllers can only say whether a manifest is syntactically valid; they cannot know whether the Skipper-specific predicates and filters referenced inside annotations exist or accept the arguments given. The right question is "would Skipper accept this route?" — which requires Skipper's own validation logic (Source: sources/2026-04-08-zalando-rejecting-invalid-ingress-routes-at-apply-time).

  2. Reuse the production validator as the admission-time validator. Rather than reimplement a checker, Zalando ran the same filter registry, predicate specifications, route validation, and backend parsers that Skipper uses at request time inside the webhook handler. The canonical architectural move is to treat the production validation stack as a library and call it from the admission path (patterns/reuse-runtime-logic-on-admission-path).

  3. Failure stays attached to the change that caused it. Before the webhook, kubectl apply succeeded and the broken route was discovered later in the routing layer by whoever was watching support channels. After the webhook, the apply fails with the Skipper-specific error — the developer who made the change fixes it immediately, no runtime log archaeology (concepts/shift-left-validation).

  4. Actionable error text. The webhook propagates Skipper's own rejection message verbatim through Kubernetes's admission-denied response. Example: an Ingress with zalando.org/skipper-predicate: NonExistingPredicate() fails kubectl apply with "invalid 'zalando.org/skipper-predicate' annotation: unknown_predicate: predicate 'NonExistingPredicate' not found" — the message tells the engineer what annotation is wrong and why, not just that admission was denied.

  5. Blast radius is on writes, not on traffic. A bad admission webhook doesn't drop live customer requests — it blocks kubectl apply. That sounds benign but means blocked CI/CD pipelines, delayed service updates, and engineers unable to ship changes fleet-wide if the validator is wrong. The risk class is control-plane-at-the- write-path, not data-plane (concepts/control-plane-change-blast-radius).

  6. Feature flag as fast rollback. -enable-advanced-validation sat in front of the new Skipper-specific validation so Zalando could turn it off per-cluster if false-positive rejections started blocking real valid configuration, without removing the webhook itself. During the rollout they did hit cases where valid routes were being rejected; the playbook was flag-off → fix the validator → flag-on again, not roll back the entire webhook (concepts/feature-flag-rollback-for-validator).

  7. skipper_route_invalid{route_id, reason} was the rollout signal. Before flipping the flag tier-up, Zalando watched this metric to distinguish real configuration mistakes (which the webhook should reject) from false positives in the validator (which meant the webhook was wrong). The metric was disclosed by name, not just as a category (concepts/invalid-route-observability-metric).

  8. Invisible rollout is the ideal outcome for control-plane changes. Zavodskikh presented this solution at an internal Zalando conference; the first audience question was how teams could enable it in their clusters. The answer was that they didn't need to — it was already enabled. Canonical shape for rolling out fleet-wide validation: default-on, seamless for users who weren't writing invalid manifests (patterns/invisible-rollout-via-default-on-validation).

  9. Upstreamed, not Zalando-only. The validator lives in open-source Skipper (github.com/zalando/skipper) from v0.24.18 onward, so any Skipper operator can enable -enable-advanced-validation in their own admission webhook deployment. Consistent with Zalando's pattern of keeping the Skipper ingress stack (routesrv, CHLB, bounded-load, admission validation) upstream rather than forked.

  10. Scale numbers as risk-framing. The post leads with "250+ Kubernetes clusters, 15k+ ingresses, ~200k routes, 500k-2M RPS" — and specifically with the 1%-invalid-routes is still 2,000 broken routes framing. The case for the webhook isn't made in the abstract; it's made with the observation that at this scale, noise floors become production risk.

Systems extracted

  • Skipper — the HTTP router that Zalando ships as the K8s ingress proxy fleet-wide and that now also runs as the admission-webhook validator. The symmetry of "Skipper validates Skipper" is the whole architectural move: one codebase produces both the admission-time and request-time answers, so they cannot drift.
  • Kubernetes — supplies the ValidatingAdmissionWebhook primitive: for matching CREATE/UPDATE operations on Ingress / RouteGroup, the API server sends an AdmissionReview to the webhook and honours its allow/deny verdict before persisting the object to etcd. The webhook mechanism is the substrate Zalando plugs Skipper's validator into.

Concepts extracted

  • concepts/validating-admission-webhook — Kubernetes' admissionregistration.k8s.io/v1 primitive for synchronous allow/deny decisions on write-path API operations. Zalando canonical deployment: pods named ingress-admitter.teapot.zalan.do, mutating and validating phases in Kubernetes admission control covered in Kubernetes docs.
  • concepts/shift-left-validation — the general engineering stance of moving a correctness check from runtime to write-time (apply-time, build-time, commit-time). Zalando's webhook is the ingress-routing instance of this.
  • concepts/feature-flag-rollback-for-validator — running a new validator behind a feature flag (-enable-advanced-validation) specifically so that false-positive rejections are a flag-flip away from reverting, not a code rollback. Sub-class of feature flag specialised for control- plane enforcement code.
  • concepts/invalid-route-observability-metric — the skipper_route_invalid{route_id, reason} metric as the per-tier rollout gate: distinguishes the validator's correct rejections (real user mistakes) from wrong rejections (validator bug). Generalised: before shipping fleet-wide enforcement, ship the observability of what would be rejected first.
  • concepts/control-plane-change-blast-radius — specialised framing of concepts/blast-radius for control-plane changes: the worst failure mode of an overly-strict admission webhook is not dropped customer requests (those keep flowing on the old routing table) but a frozen write path — CI/CD and manual operator kubectl apply blocked fleet-wide. The risk class matters because it informs the rollback design (feature flag ≫ full rollback).

Patterns extracted

  • patterns/reuse-runtime-logic-on-admission-path — run the same validator library the production data path uses when accepting/rejecting objects at admission time, so the two answers cannot drift. Zalando's canonical instance is Skipper validating Skipper; generalised form applies to any controller + CRD stack with its own DSL beyond Kubernetes's built-in schema validation.
  • patterns/invisible-rollout-via-default-on-validation — roll out fleet-wide validation such that teams writing valid manifests observe no difference. When users ask "how do I opt in?" the correct answer is "you don't." Tier-by-tier rollout, advanced-validation feature flag, and the skipper_route_invalid observability metric together make this invisible rollout safe.
  • patterns/feature-flagged-dual-implementation (existing) — applied here to the validator itself: the webhook shipped with both the old basic-syntactic-check path and the new advanced-validation path available, toggled by -enable-advanced-validation. Existing pattern; this ingress-validator use is a new Seen-in.

Operational numbers

Quantity Value Context
Kubernetes clusters 250+ Zalando fleet running Skipper
Ingress objects 15k+ Across the fleet
Eskip routes ~200k Derived from the Ingresses + RouteGroups
Peak RPS 500k – 2M Span of traffic envelope Skipper fronts
Invalid route tolerance 1% → 2k broken Zalando's framing that 1% at 200k is not background noise
Skipper release exposing the validator v0.24.18 Feature-flagged under -enable-advanced-validation
Admission-webhook service name ingress-admitter.teapot.zalan.do Per the reproduced error message in the post

Caveats and gaps

  • Webhook latency and availability not disclosed. The post does not quantify the P50/P99 latency that ingress-admitter.teapot.zalan.do adds to kubectl apply, or the availability / replica count / failure-mode (failurePolicy: Fail vs Ignore) configured on the ValidatingWebhookConfiguration. For a webhook that gates all Ingress writes fleet-wide, these numbers matter for blast-radius analysis; their absence is a gap the Seen-in for concepts/validating-admission-webhook flags.
  • No false-positive rate numbers. The post says they encountered cases where valid routes were rejected during rollout, but doesn't quantify how many or from which validator rules. The -enable-advanced-validation flag is confirmed to have been needed; the frequency of flag-offs is not disclosed.
  • No quantified before/after on runtime incidents. The motivation — "people showed up in support channels asking why their requests don't work" — is the qualitative before-state; the post does not claim a specific reduction in routing-layer incidents attributable to the webhook. The value proposition is framed as moving the feedback loop, not as measurable incident reduction.
  • No comparison with server-side apply / Gatekeeper / Kyverno. Zalando chose to run Skipper's own validator in a webhook rather than encode the rules in a policy engine like Kyverno or OPA/Gatekeeper. The trade-off is acknowledged implicitly (the whole post's thesis is that reusing Skipper's validator is better than reimplementing the rules elsewhere) but a policy-engine alternative is not named.
  • No discussion of RouteGroup vs Ingress differential validation. Skipper reads both; the post shows an Ingress example but doesn't enumerate whether the webhook treats the two resources identically or has RouteGroup-specific validation. Small gap for a future deep-dive.

Source

Last updated · 507 distilled / 1,218 read