CONCEPT Cited by 1 source
Custom histogram buckets¶
Definition¶
Custom histogram buckets are explicit bucket boundaries
passed to a metrics histogram instrument, overriding the
library default, to match the expected distribution of the
measured quantity. For OpenTelemetry JS, this is declared via
the SDK's view API or the newer
MetricAdvice.explicitBucketBoundaries
at histogram-creation time.
Why defaults fail for some workloads¶
Histogram instruments record how many observations fell into
each bucket range. They are designed around a default bucket
layout tuned for server latency in milliseconds recorded
over time — many observations per second, typically
0-10000 ms. OTel JS's default set is
[0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000]
(Source:
OTel JS Aggregation.ts).
When the quantity being measured has a range that differs
significantly from 0-10000 ms, the default buckets produce
skewed distributions — most observations collapse into one
or two buckets and percentile queries over the histogram return
meaningless results.
Canonical failure case: Core Web Vitals¶
LCP typically lives between
600 ms and 2000 ms. With OTel defaults, every observation in
that range falls into the single 1000-2500 ms bucket — p50
and p75 are indistinguishable.
CLS is unitless 0-1; every CLS observation falls into the
first bucket 0-5, resolution zero.
Zalando's solution¶
Declare per-metric buckets via OTel's view API (Source:
sources/2024-07-28-zalando-opentelemetry-for-javascript-observability-at-zalando).
Excerpt from the post, showing the approach:
const metricBuckets = {
fcp: [0, 100, 200, 300, 350, 400, 450, 500, 550, 650, 750,
850, 900, 950, 1000, 1100, 1200, 1500, 2000, 2500, 5000],
lcp: [/* ~32 buckets, dense around the 1000-2000 ms band */],
cumulativeLayoutShift: [0, 0.025, 0.05, 0.075, 0.1, 0.125,
0.15, 0.175, 0.2, 0.225, 0.25, 0.275, 0.3, 0.35, 0.4, 0.45,
0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95,
1, 1.25, 1.5, 1.75, 2],
// ...
};
- FCP buckets: denser around 300-1000 ms where the good / needs-improvement boundaries live.
- CLS buckets: 0.025 steps below 0.25 (good ≤0.1, needs improvement ≤0.25 per Google guidelines), coarser above.
Derivation strategies¶
The post doesn't specify the derivation method, but reasonable strategies are:
- Observed-distribution fit — pull p10/p50/p90 from pre-rollout data, pack buckets densely around those values.
- Threshold-aligned — match Google Web Vitals good / needs-improvement / poor thresholds (LCP at 2500/4000 ms; CLS at 0.1/0.25).
- Log-spaced near p50 — traditional for latency distributions.
Alternative: events API¶
For single-value-per-event metrics (like CWV), OTel's events API may be more appropriate than any histogram bucket arrangement. Zalando flagged this at KubeCon Paris but hasn't migrated.