Skip to content

CONCEPT Cited by 1 source

X-Opaque-Id client attribution

Definition

X-Opaque-Id client attribution is the discipline of propagating a unique, caller-supplied identifier on the HTTP X-Opaque-Id request header so that server-side observability surfaces (slow-query log, tracing, rejection log) can attribute each request to which caller service sent it. The header is "opaque" to the server — the server doesn't interpret it, only logs it — which is exactly the property that makes it useful: it lets each client team stamp its own taxonomy without coordinating with the server-side schema.

In Elasticsearch specifically, the slow-query log's X-Opaque-Id integration records the header value alongside each slow query so per-client slow-query rates and aggregation shapes become queryable by caller, not just by cluster-aggregate symptoms.

Why it's load-bearing for self-inflicted-DoS debugging

Zalando's 2025-12-16 post-mortem lists the five reasons the incident's root cause hid from operators; two of them are direct consequences of missing per-caller attribution on Elasticsearch queries:

"Because the slow queries, while being monitored, were not being analyzed in depth. The team was focused on the overall cluster health and performance metrics, and the slow queries were just a symptom of the larger issue."

"Because the slow queries didn't have any specific tags or identifiers that would link them to the client application. They were just faceting queries, indistinguishable from any other faceting queries that might be executed by legitimate users."

The second bullet is the definition of the problem X-Opaque-Id solves. Zalando's explicit remediation:

"We extended the slow query logging to capture more details about the queries being executed, including client identifiers via the X-Opaque-Id request header. Based on that, we also extended the dashboards to monitor per-client slow query rates, and specifically aggregating queries and the aggregation sizes." (Source: sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search.)

The operational capability this unlocks — per-caller slow-query dashboards — is canonicalised as patterns/per-client-slow-query-dashboard.

What belongs in the header value

The header value is a hierarchical, greppable, caller-chosen identifier, similar to throttler client identity. Common shapes:

  • <service>-<operation> — e.g. catalog-api-facets-brand-filter
  • <service>/<pod>/<request-uuid> — for sub-service routing
  • <team>/<service>/<workload> — when the team ownership is the debug pivot

Load-bearing properties:

  1. Stable across retries of the same logical operation — so slow-query dashboards aggregate meaningfully.
  2. Distinct per logical caller — at least one axis must vary so (A.foo, A.bar) are distinguishable in the log.
  3. Opaque to the server — no parsing, no schema evolution coupling. Changing the value doesn't require a server change.
  4. Not a secret / PII — it ends up in the slow-query log and traces, which have different security postures than app data.

Scope of applicability

This concept is not Elasticsearch-specific. Any shared backend that:

  • serves many internal callers,
  • has a slow-query / slow-request log,
  • and cannot tell callers apart by IP (service mesh / NAT / shared egress),

has the same attribution gap and benefits from the same fix. Databases (MySQL set_var / SQL-comment tagging — see concepts/sqlcommenter-query-tagging), HTTP APIs (standard X-Request-Id / trace IDs), and message-queue backends all have dialect-specific analogues.

Relationship to distributed-tracing IDs

  • Trace ID (W3C traceparent, Datadog x-datadog-trace-id) is uniquely per-request — good for reconstructing one request's path, less good for aggregate "which caller is slowest?" dashboards because every value is different.
  • X-Opaque-Id is per-logical-caller-class — designed to aggregate. Multiple requests from the same caller share the same value.
  • Both can coexist — the trace ID routes you from a dashboard pivot to a specific request; the opaque ID gives you the dashboard pivot in the first place.

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Zalando's Search & Browse team extended their Elasticsearch slow-query logging pipeline to capture X-Opaque-Id values from the calling services (Catalog API, NER query builder, internal analytics workloads, etc.) so per-client slow-query dashboards could flag caller anomalies (high per-query cost, sudden volume spike on aggregation queries) that cluster-aggregate dashboards missed. The follow-up dashboard specifically tracks "per-client slow query rates, and specifically aggregating queries and the aggregation sizes" — exactly the cross-section the 2025-12-16 incident needed and didn't have.
Last updated · 507 distilled / 1,218 read