Skip to content

CONCEPT Cited by 1 source

Custom histogram buckets

Definition

Custom histogram buckets are explicit bucket boundaries passed to a metrics histogram instrument, overriding the library default, to match the expected distribution of the measured quantity. For OpenTelemetry JS, this is declared via the SDK's view API or the newer MetricAdvice.explicitBucketBoundaries at histogram-creation time.

Why defaults fail for some workloads

Histogram instruments record how many observations fell into each bucket range. They are designed around a default bucket layout tuned for server latency in milliseconds recorded over time — many observations per second, typically 0-10000 ms. OTel JS's default set is [0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000] (Source: OTel JS Aggregation.ts).

When the quantity being measured has a range that differs significantly from 0-10000 ms, the default buckets produce skewed distributions — most observations collapse into one or two buckets and percentile queries over the histogram return meaningless results.

Canonical failure case: Core Web Vitals

LCP typically lives between 600 ms and 2000 ms. With OTel defaults, every observation in that range falls into the single 1000-2500 ms bucket — p50 and p75 are indistinguishable. CLS is unitless 0-1; every CLS observation falls into the first bucket 0-5, resolution zero.

Zalando's solution

Declare per-metric buckets via OTel's view API (Source: sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando). Excerpt from the post, showing the approach:

const metricBuckets = {
  fcp: [0, 100, 200, 300, 350, 400, 450, 500, 550, 650, 750,
        850, 900, 950, 1000, 1100, 1200, 1500, 2000, 2500, 5000],
  lcp: [/* ~32 buckets, dense around the 1000-2000 ms band */],
  cumulativeLayoutShift: [0, 0.025, 0.05, 0.075, 0.1, 0.125,
    0.15, 0.175, 0.2, 0.225, 0.25, 0.275, 0.3, 0.35, 0.4, 0.45,
    0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95,
    1, 1.25, 1.5, 1.75, 2],
  // ...
};
  • FCP buckets: denser around 300-1000 ms where the good / needs-improvement boundaries live.
  • CLS buckets: 0.025 steps below 0.25 (good ≤0.1, needs improvement ≤0.25 per Google guidelines), coarser above.

Derivation strategies

The post doesn't specify the derivation method, but reasonable strategies are:

  • Observed-distribution fit — pull p10/p50/p90 from pre-rollout data, pack buckets densely around those values.
  • Threshold-aligned — match Google Web Vitals good / needs-improvement / poor thresholds (LCP at 2500/4000 ms; CLS at 0.1/0.25).
  • Log-spaced near p50 — traditional for latency distributions.

Alternative: events API

For single-value-per-event metrics (like CWV), OTel's events API may be more appropriate than any histogram bucket arrangement. Zalando flagged this at KubeCon Paris but hasn't migrated.

Last updated · 501 distilled / 1,218 read