ZALANDO

Zalando — Open Policy Agent in Skipper Ingress¶

Summary¶

Zalando integrated Open Policy Agent (OPA) as a library inside Skipper, their Go-based Kubernetes ingress proxy, to deliver Authorization as a Service to application teams across their 15,000-ingress / 5,000-routegroup / 2,000,000 rps fleet. Enabling OPA for an application is a single annotation (opaAuthorizeRequest("my-application")) on the Kubernetes Ingress — no new deployment, no YAML sprawl, no bespoke monitoring. Inside Skipper, one virtual OPA instance per referenced application coexists in the same Go process, sharing memory across routes and riding a grace-period GC buffer against high-frequency route churn. Policy bundles are authored in Rego inside application Git repos, published through Styra DAS (their commercial OPA control plane), and distributed via AWS S3 so the data-plane keeps running even when the control plane is down. Observability goes to Lightstep via two span paths — per-decision spans and control-plane spans — so decision IDs in traces link back into Styra DAS decision logs. The post ends with three explicit trade-offs (embedding latency vs OOM risk, OPA flexibility vs CPU cost, OPA by default vs on-demand bootstrap) and a differentiation against the vanilla Envoy OPA plugin: multiple virtual OPAs per deployment, and standalone HTTP-serving OPA for SPAs / legacy IAM migrations.

Key takeaways¶

Authorization as a platform capability. Platform engineers own how OPA runs (storage, telemetry, bundle distribution); app engineers own the policies themselves in Rego inside their own Git repos. "Enabling OPA for a specific application is as easy as just stating 'application X should be protected' without touching multiple YAML files, adding monitoring, and inheriting many more responsibilities to be compliant." This is the concepts/platform-team-vs-application-team-split applied to authorization — canonicalised as patterns/ingress-layer-authorization-offload (Source: sources/2024-12-05-zalando-open-policy-agent-in-skipper-ingress).
OPA embedded as a Go library in Skipper, not as a sidecar or separate deployment. "Embedding OPA directly within Skipper as a library ensures minimal latency in policy enforcement by keeping policy decisions local to the ingress data plane. It also is cost efficient compared to running an OPA deployment per application or as sidecars." Canonicalised as concepts/embedded-opa-library-in-proxy and patterns/embedded-opa-in-proxy (Source: sources/2024-12-05-zalando-open-policy-agent-in-skipper-ingress).
Multiple virtual OPA instances coexist within a single Skipper process, one per application referenced in any route. "Inside Skipper, we create one virtual OPA instance per application that is referenced in at least one of the routes. This allows us to re-use memory and also provides a buffer against high-frequency route changes by having a grace period for garbage collection." This is structurally distinct from the vanilla Envoy OPA plugin model of "one OPA process per application". Canonicalised as concepts/virtual-opa-instance-per-application and patterns/virtual-policy-instance-per-application (Source: sources/2024-12-05-zalando-open-policy-agent-in-skipper-ingress).
S3 is the data-plane policy-bundle source; Styra DAS is only the author / publish path. "To reduce the likelihood of outages due to an authorization infrastructure failure, we use AWS S3 and its availability promises as the source for policy bundles. Styra DAS, a commercial control plane for Open Policy Agent is used to source the bundles and publish them to S3. … This approach allows us to scale and fail-over despite failures of our OPA control plane and only depends on S3 being available." Canonicalised as patterns/s3-as-policy-bundle-source-for-availability — a direct instance of concepts/control-plane-data-plane-separation applied to authorization (Source: sources/2024-12-05-zalando-open-policy-agent-in-skipper-ingress).
Input-schema alignment with the upstream OPA Envoy plugin. "We chose to align closely with the OPA Envoy plugin's input structures to leverage existing documentation, examples, and training resources. This minimises the learning curve for our developers and keeps Zalando-isms at bay." Canonicalised as patterns/align-with-upstream-plugin-input-schema (Source: sources/2024-12-05-zalando-open-policy-agent-in-skipper-ingress).
Two OTel span paths for observability — per policy decision (decision ID + outcome + bundle name + OPA labels) and per control-plane round-trip (bundle fetch + status / decision reporting). "This allows linking directly into the full decision as stored in Styra DAS but also allows capturing metrics right in Lightstep and only based on the traces." — sent to Lightstep via OTel.
Bounded data structures throughout to cap OOM risk from embedding. "We mitigated this by implementing strict limits on bundle size and also doing constrained memory consumption for advanced features like request body parsing. Telemetry like decision streaming and status reports also use bounded data structures to avoid memory exhaustion." Canonicalised as patterns/bounded-telemetry-data-structures-for-policy-engine.
Two features that distinguish the Skipper integration from the vanilla Envoy OPA plugin: (1) multiple virtual OPA instances per process, (2) OPA can serve HTTP responses independently of the target application — useful for "migrating existing legacy IAM services" and "supporting single-page applications (SPAs) that require precomputed authorization decisions or lists of permissions for the current users."
Developer-facing contract is one filter annotation. The full opt-in for an application:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    zalando.org/skipper-filter: |
      opaAuthorizeRequest("my-application")
  labels:
    application: my-application
  name: my-application
spec:
  rules:
  - host: zalando.example
    http:
      paths:
      - backend:
          service:
            name: my-application
            port:
              number: 8080
        pathType: ImplementationSpecific

The parameter is both the OPA policy bundle name and the registered application ID — bundle naming is piggybacked on Zalando's application registry, so platform-wide governance structure provides policy-structure discipline for free.

On-demand bootstrap, not OPA-by-default. "OPA is only enabled and bootstrapped only if at least one application uses OPA in a Kubernetes cluster and if the cluster is enabled to support OPA. Skipper instances which have OPA-enabled routes are generally scaled up to compensate for higher cpu consumption due to policy execution." — the cost of OPA is paid per-cluster only when someone actually uses it, and Skipper replica count is adjusted to absorb the CPU hit.

Systems extracted¶

Skipper — the Go-based Kubernetes ingress proxy that hosts the embedded OPA library; filter chain composition point for the opaAuthorizeRequest filter.
Open Policy Agent (OPA) — CNCF policy engine; embedded as a Go library inside Skipper.
Rego — OPA's Datalog-inspired policy language, used to author the per-application policy bundles.
Styra DAS — commercial OPA control plane (Styra Declarative Authorization Service) used to source + publish bundles; also receives decision + status logs.
AWS S3 — the data-plane source of policy bundles (not Styra DAS); chosen for its availability promises.
Envoy — referenced as the canonical OPA input structure to align with (Zalando does not deploy Envoy for ingress — Skipper does Kubernetes Ingress — but matches the upstream OPA Envoy plugin's input contract).
Kubernetes Ingress Controller for AWS — provisions the AWS NLB in front of Skipper.
External DNS — DNS sync controller creating the A-record pointing to the NLB.
AWS NLB — fronts Skipper with TLS termination (referenced as part of the ingress deployment context).
Lightstep — observability backend; receives two kinds of OTel spans (policy decisions + control-plane traffic).
OpenTelemetry — span emission transport for both decision spans and control-plane spans.
Kubernetes — underlying orchestration; the Ingress + annotation is the developer contract.

Concepts extracted¶

concepts/authorization-as-a-service — authorization delivered as a platform capability where app teams supply only policies, not infrastructure ("application X should be protected" is the whole opt-in).
concepts/embedded-opa-library-in-proxy — running OPA as a library inside the ingress data-plane process, not as a sidecar or separate deployment. Minimises policy-decision latency + removes the per-app deployment cost tax.
concepts/virtual-opa-instance-per-application — multiple logically isolated OPA instances coexist within a single Go process, one per application, sharing runtime and benefitting from grace-period GC during route churn.
concepts/policy-bundle — OPA's distribution unit: a tarball of .rego policies + data, fetched from a remote source (here S3), periodically refreshed, versioned, and status-reported back to the control plane.
concepts/control-plane-data-plane-separation — already a wiki anchor; this ingest canonicalises the authorization instance: Styra DAS is control plane (author + publish), S3 is data-plane substrate, Skipper is enforcement point. Outage of the control plane does not impact enforcement.
concepts/externalised-authorization — policies live outside application code + outside application deploys; apps stop inheriting authorization responsibility.
concepts/platform-team-vs-application-team-split — generalisable axis: platform owns how authorization runs; app teams own the policies themselves. The filter parameter ("my-application") is the seam.

Patterns extracted¶

patterns/embedded-opa-in-proxy — fold OPA into the ingress proxy's process as a library rather than running it as a sidecar or separate deployment.
patterns/virtual-policy-instance-per-application — one policy-engine instance per tenant, all coexisting in a single host process with shared runtime + GC grace period against churn.
patterns/s3-as-policy-bundle-source-for-availability — point the policy-engine bundle loader at object storage (S3) so the data plane stays operational during control-plane outages; author-side publishes to the same bucket.
patterns/ingress-layer-authorization-offload — move authorization into the ingress filter chain so app services can be authorization-free; one-line Ingress annotation opts in.
patterns/align-with-upstream-plugin-input-schema — when building a variant integration of an OSS policy engine (or any plugin with a published input contract), match the canonical schema to inherit docs, training, and third-party examples.
patterns/bounded-telemetry-data-structures-for-policy-engine — cap every in-memory state path (bundles, decision streams, status reports, request-body inspection buffers) because the library shares the host's OOM fate.

Operational numbers¶

15,000 Ingresses across the Zalando fleet.
5,000 routegroups.
Up to 2,000,000 requests per second peak traffic.
80–90 % of traffic is authenticated service-to-service.
500,000 – 1,000,000 rps aggregate service-fleet service-to-service daily numbers.

Architecture (from the post's two diagrams)¶

Skipper process layout — a single Skipper Go process hosts N routes. Some subset of those routes reference an opaAuthorizeRequest filter naming application IDs. The process starts one virtual OPA instance per referenced application ID; a virtual instance is a logically isolated runtime that uses its own policy bundle, its own labels, its own decision log buffer. When a route is removed, the corresponding virtual instance is not immediately destroyed — a grace period absorbs high-frequency route churn; garbage collection reaps unused instances later.
Control plane — policy authors in application Git repos push Rego bundles; Styra DAS is the commercial authoring / publish layer that writes the built bundle to an S3 bucket. Skipper virtual OPA instances poll S3 as their bundle source. Each virtual instance also emits status updates + decision logs back toward Styra DAS via its status / decision-log plugins, and OpenTelemetry spans toward Lightstep. If Styra DAS is down, S3 continues to serve bundles and Skipper continues to enforce — the only hard dependency in the data-plane is S3.

Trade-offs (explicit in the post)¶

Axis	Choice	Cost of choice	Mitigation
Latency vs memory	Embed OPA in proxy	OOM risk, blast radius coupled to proxy	Strict bundle-size limit, bounded request-body parsing, bounded decision / status buffers
Flexibility vs cost	General-purpose Rego engine	Higher CPU than bespoke token validation	Accept cost; bet that fine-grained policy value > CPU tax
Always-on vs on-demand	OPA bootstraps only when ≥1 app opts in AND cluster-level feature enabled	More bootstrap conditions to reason about	Scale Skipper replicas up on OPA-enabled routes to absorb CPU

Caveats / what the post does not say¶

No concrete latency overhead numbers for the OPA decision path are quoted — "minimal latency" is stated but not measured on the post.
No bundle size / policy-evaluation throughput SLOs are disclosed.
The opaAuthorizeRequest filter is described at schema level, but its exact Rego input payload (the Envoy-plugin-aligned input) is not reproduced in the post beyond the shape reference.
Multi-region / cross-region bundle distribution is implicit in the S3 choice but not discussed.
Rollout mechanics, soak period, and per-application opt-in history are not disclosed.
Primary use cases quoted: "employee- or partner facing applications and APIs where access models and authorization rules are generally more complex" — i.e. not the highest-volume customer checkout flows (this post is about where OPA is used, not about all Zalando authorization).

Source¶

systems/skipper-proxy — filter-chain host for the embedded OPA
systems/open-policy-agent — the policy engine being embedded
systems/rego — policy authoring language
systems/styra-das — commercial OPA control plane
systems/aws-s3 — data-plane bundle source
systems/envoy — upstream OPA Envoy plugin whose input schema is aligned
systems/lightstep — trace backend
concepts/authorization-as-a-service
concepts/embedded-opa-library-in-proxy
concepts/virtual-opa-instance-per-application
concepts/policy-bundle
concepts/control-plane-data-plane-separation
patterns/embedded-opa-in-proxy
patterns/virtual-policy-instance-per-application
patterns/s3-as-policy-bundle-source-for-availability
patterns/ingress-layer-authorization-offload
patterns/align-with-upstream-plugin-input-schema
patterns/bounded-telemetry-data-structures-for-policy-engine
companies/zalando