Skip to content

PATTERN Cited by 1 source

Per-client slow-query dashboard

The pattern

A per-client slow-query dashboard is a dashboard that breaks the slow-query log down by caller identity, surfacing "which caller produced how many slow queries, and of what shape" as a first-class observability primitive. Without this breakdown, a slow-query log is just a stream of "there are slow queries" — useful for tail-latency-altitude monitoring, useless for self-inflicted-DoS defence because you cannot tell which caller is responsible.

The dashboard's load-bearing axes:

  1. Per-caller slow-query rate (slow queries per second, grouped by caller identity)
  2. Per-caller slow-query shape — for databases this is SQL templates; for Elasticsearch this is which aggregations and on which fields
  3. Per-caller aggregation size — when an aggregation query produces N buckets, N is itself a predictor of future pathology
  4. Trend / spike detection — a caller going from 2 slow- queries/min to 200 slow-queries/min is a clearer signal than the absolute rate

The enabling mechanism is per-request caller attribution on the query path, most often via X-Opaque-Id for Elasticsearch or SQL comment tags for relational databases.

Why the default slow-query log is insufficient

The default slow-query log in most backends (Elasticsearch, MySQL, PostgreSQL, MongoDB) records:

  • timestamp
  • query shape / text
  • execution time
  • shard / index / table

It does not record "who asked this." The failure mode is exactly what Zalando's Search & Browse team hit:

"Because the slow queries, while being monitored, were not being analyzed in depth. The team was focused on the overall cluster health and performance metrics, and the slow queries were just a symptom of the larger issue. Because the slow queries didn't have any specific tags or identifiers that would link them to the client application. They were just faceting queries, indistinguishable from any other faceting queries that might be executed by legitimate users."

sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search

The three-piece retrofit

Zalando's canonical deployment ships this pattern as three coordinated changes:

  1. Client side: every outbound request from a query-issuing service populates X-Opaque-Id: <service>-<workload>-<uuid> (or equivalent caller taxonomy).
  2. Backend side: slow-query log is configured to include the opaque-id field. For Elasticsearch, this uses the built-in X-Opaque-Id slow-query-log integration.
  3. Dashboard: the slow-query log ingestion pipeline (usually the observability platform's log shipper → time-series store) keys on the opaque-id field so dashboards can group and alert per caller.

Zalando's specific commitment, verbatim:

"We extended the slow query logging to capture more details about the queries being executed, including client identifiers via the X-Opaque-Id request header. Based on that, we also extended the dashboards to monitor per-client slow query rates, and specifically aggregating queries and the aggregation sizes."

What the dashboard is for (and what it isn't)

For:

  • Detection of pathological callers — anomaly detection on per-caller slow-query rate, routed to the caller team's on-call.
  • Root-cause acceleration during incidents — during a cluster- saturation incident, the dashboard's "top N callers by slow- query count in the last 15 minutes" pivot identifies candidates within minutes.
  • Capacity-planning conversations — which teams are producing the most expensive queries, as input to shared- infrastructure cost allocation.
  • Query-review feedback loop — when a team's aggregation size crosses a threshold, auto-file a ticket back to that team with a link to the slow query.

Not for:

  • Real-time admission control. Dashboards are diagnostic, not gates. Admission control is the job of patterns/application-side-query-limit-with-dynamic-threshold
  • patterns/cluster-wide-aggregation-guardrail. A dashboard watching the fire is not the same as a shutoff valve.
  • Attributing external traffic. The opaque-id is populated by callers inside the authenticated perimeter; external traffic won't have it unless an edge layer synthesises one.
  • Replacing tracing. Traces remain the right tool for single-request root-cause. The dashboard is for aggregate, caller-keyed patterns.

Interaction with the zebra-hunt playbook

The dashboard is a zebra-hunt accelerator. When the first-line-playbook horses have been eliminated, the dashboard's per-caller cross-section surfaces low-volume-high- cost callers that would otherwise be invisible in volume-based monitoring — exactly the failure mode Zalando hit.

Adjacent / sibling patterns

Seen in

  • sources/2025-12-16-zalando-the-day-our-own-queries-dosed-us-inside-zalando-search — canonical wiki instance. Follow-up action after the self-inflicted DoS incident. Zalando's Search & Browse team wired X-Opaque-Id through every calling service (Catalog API, NER query builder, and internal analytics / maintenance callers) into the Elasticsearch slow-query log, and built dashboards keyed on per-caller slow-query rate plus per-caller aggregation sizes. The operational commitment: callers that exhibit anomalous slow-query patterns become visible without waiting for a cluster-wide incident.
Last updated · 507 distilled / 1,218 read