CONCEPT Cited by 1 source
Metric-granularity mismatch¶
Metric-granularity mismatch is the observability failure mode where a dashboard or integration surfaces a metric at the wrong aggregation level for the question the operator is asking — most commonly, per-leaf (per-shard, per-worker, per-partition) timing masquerading as end-to-end user-visible latency.
In fan-out systems the gap can be enormous: a user query that fans out to N workers has latency max(worker latencies) + a coordinator tax, but the average worker latency looks impressively small. Reading the leaf metric as if it were the top-level one understates the real number by the fan-out ratio.
Canonical instance (Figma, 2026)¶
Figma's DataDog integration reported an "average OpenSearch query" of 8 ms while the service's p99 was ~1 s. The 8 ms was the per-shard query time between coordinator and worker nodes; for Figma's configuration, up to ~500 per-shard queries fanned out per user query, so coordinator-view latency was ~150 ms avg / 200–400 ms p99 / 40 ms min — with min > DataDog's reported "max", which was the red flag.
Key observability nuance: OpenSearch
does not emit overall query time at all in its metrics or logs.
The only overall-latency field is the took value in the query
API response body. Figma's fix was to parse took out of every
search response and publish it as a custom metric.
(Source: sources/2026-04-21-figma-the-search-for-speed-in-figma)
Tell-tale signs¶
- Reported "max" is lower than observed "min". Your timing wrapper and their integration can't both be right.
- Averages look amazing, p99 looks terrible, and the ratio is suspiciously close to the fan-out width.
- Vendor integration surfaces internal-system telemetry verbatim rather than normalising to API-call granularity.
- Search- / query-/ RPC-level metrics and response-body-level
fields differ (OpenSearch
took, gRPC trailers, DBpg_stat_statements.total_exec_timevs application spans).
How to avoid / detect it¶
- Sanity-check with two independent vantage points. Wrap API calls at the client boundary and compare. If the client says 150 ms and the "built-in" metric says 8 ms, the built-in is not measuring what you think.
- Read the docs for what each metric definition actually covers — "average query latency" in a fan-out system is usually "average worker latency."
- Publish a ground-truth latency metric yourself (from the response-body field that quotes overall time, or from a wrapping span). Don't rely on vendor-default dashboards for capacity planning.
- For fan-out systems, capture the fan-out width alongside the per-leaf latency — then at least the mis-scaled number is recoverable from the pair.
Adjacent concepts¶
- concepts/tail-latency-at-scale — why max-of-N dominates even when averages are fine. The granularity mismatch makes that math invisible.
- concepts/queueing-theory — per-layer queue vs end-to-end queue: same class of mistake at the observability layer.
- concepts/observability — broader umbrella.
Seen in¶
- sources/2026-04-21-figma-the-search-for-speed-in-figma —
OpenSearch 8 ms (per-shard avg) vs 150 ms (coordinator avg)
vs 1 s (API p99); hidden for months until the min-above-their-
max signal tripped. Fix: publish
tookas a custom metric.