Skip to content

CONCEPT Cited by 1 source

Metric-granularity mismatch

Metric-granularity mismatch is the observability failure mode where a dashboard or integration surfaces a metric at the wrong aggregation level for the question the operator is asking — most commonly, per-leaf (per-shard, per-worker, per-partition) timing masquerading as end-to-end user-visible latency.

In fan-out systems the gap can be enormous: a user query that fans out to N workers has latency max(worker latencies) + a coordinator tax, but the average worker latency looks impressively small. Reading the leaf metric as if it were the top-level one understates the real number by the fan-out ratio.

Canonical instance (Figma, 2026)

Figma's DataDog integration reported an "average OpenSearch query" of 8 ms while the service's p99 was ~1 s. The 8 ms was the per-shard query time between coordinator and worker nodes; for Figma's configuration, up to ~500 per-shard queries fanned out per user query, so coordinator-view latency was ~150 ms avg / 200–400 ms p99 / 40 ms min — with min > DataDog's reported "max", which was the red flag.

Key observability nuance: OpenSearch does not emit overall query time at all in its metrics or logs. The only overall-latency field is the took value in the query API response body. Figma's fix was to parse took out of every search response and publish it as a custom metric. (Source: sources/2026-04-21-figma-the-search-for-speed-in-figma)

Tell-tale signs

  • Reported "max" is lower than observed "min". Your timing wrapper and their integration can't both be right.
  • Averages look amazing, p99 looks terrible, and the ratio is suspiciously close to the fan-out width.
  • Vendor integration surfaces internal-system telemetry verbatim rather than normalising to API-call granularity.
  • Search- / query-/ RPC-level metrics and response-body-level fields differ (OpenSearch took, gRPC trailers, DB pg_stat_statements.total_exec_time vs application spans).

How to avoid / detect it

  • Sanity-check with two independent vantage points. Wrap API calls at the client boundary and compare. If the client says 150 ms and the "built-in" metric says 8 ms, the built-in is not measuring what you think.
  • Read the docs for what each metric definition actually covers — "average query latency" in a fan-out system is usually "average worker latency."
  • Publish a ground-truth latency metric yourself (from the response-body field that quotes overall time, or from a wrapping span). Don't rely on vendor-default dashboards for capacity planning.
  • For fan-out systems, capture the fan-out width alongside the per-leaf latency — then at least the mis-scaled number is recoverable from the pair.

Adjacent concepts

Seen in

Last updated · 200 distilled / 1,178 read