Skip to content

PATTERN Cited by 1 source

Bounded telemetry data structures for policy engine

Problem

When you embed a policy engine inside an ingress proxy, the engine's OOM fate and the proxy's OOM fate are now the same. Any unbounded in-memory path inside the engine becomes a data-plane incident. The three canonical unbounded paths:

  1. Policy bundles. A policy author can publish a bundle of arbitrary size; a compile of an accidentally-huge bundle expands an arbitrary multiple in memory.
  2. Request-body parsing. Advanced policy features (JSON body predicates, XML, GraphQL) may parse and retain the incoming request body in memory.
  3. Telemetry buffers. Decision logs + status reports are streamed to the control plane asynchronously; if the consumer is slow or down, an unbounded queue grows without limit.

Solution

Cap every in-memory path with hard bounds, not soft warnings. Specifically:

  • Bundle-size limit at fetch time. Reject bundles exceeding the cap; fail closed on publish, not on enforcement.
  • Request-body-parse-size limit. When a policy requires body-parse, cap the bytes the engine will read + retain per request. Beyond the cap, the policy sees a truncated body and can fail closed by choice (rejecting the request) or fall back to a body-less evaluation.
  • Telemetry bounded buffers. Decision-log and status-report queues have a fixed max size; on pressure, drop + increment a counter, rather than growing unbounded. The engine's own emission must not consume the proxy's memory headroom.

Why this must be a discipline, not an afterthought

Each of the three paths is a fan-in of outside-controlled input into in-process memory:

  • Bundle size is controlled by policy authors.
  • Request-body size is controlled by clients.
  • Telemetry queue depth is controlled by control-plane slowness.

Any of them can push the proxy over its memory limit and crash the data plane. In a traditional OPA-as-sidecar deployment this is only an OPA problem; in the embedded model, the sidecar is the ingress.

The Zalando quote

"Latency vs. Memory Consumption: Embedding OPA reduces latency but increases memory consumption, raising the risk of out-of- memory (OOM) issues. We mitigated this by implementing strict limits on bundle size and also doing constrained memory consumption for advanced features like request body parsing. Telemetry like decision streaming and status reports also use bounded data structures to avoid memory exhaustion."

Relationship to backpressure

The telemetry path is a textbook concepts/backpressure problem inverted: the engine is the producer and the control plane is the consumer. When the consumer is slow, the engine must not absorb the backlog into its own memory. Bounded buffers + drop-on-overflow is the canonical answer; the alternative (unbounded queue) is the canonical bug.

Trade-offs

  • Dropped decision logs are permanent. Audit gaps. Accept this trade-off explicitly, emit a drop counter, and surface it on dashboards.
  • Truncated-body policies degrade silently. Either fail-close on truncation (safer) or fail-open with a counter (more available). Document the choice per-policy.
  • Bundle-size caps require coordination with policy authors. If authors legitimately need large data-tables, support streaming / indexed reference data instead of a giant bundle.

Seen in

Last updated ยท 550 distilled / 1,221 read