Skip to content

CONCEPT Cited by 1 source

Metric sampling interval

Definition

Metric sampling interval is the period between successive measurements of a metric. A metric captured once per second has a sampling interval of 1 s; a metric captured on demand has a sampling interval bounded by the query frequency.

For signals used in control loops (throttlers, autoscalers, alert evaluators), the sampling interval is a parameter of the control loop, not just of the observability system. Too-long intervals cause stale decisions; too-short intervals cost CPU / network / storage.

The stale-sample problem

If the throttler's decision is based on a sample that is T seconds old, the throttler is controlling the system as it was T seconds ago — not as it is now. With long intervals:

  1. Miss the uptick. Load spike happens between samples; system degrades for ~T seconds before throttler engages.
  2. Miss the recovery. When the metric clears, the throttler keeps blocking until the next sample captures the new value — wasting ~T seconds of available capacity.
  3. Worked example (Noach, Anatomy of a Throttler, part 1): heartbeat injection at 12:00:00.000, sample at 12:00:00.995 captures that heartbeat, client checks at 12:00:01.990 gets a response based on a sample that is now ~2 s old — throttler reading a 2-s-old version of a 1-s-old metric.

The jitter problem

Even with a 1-s interval, the phase of the sample against the event stream matters. A heartbeat injected at t=0.000 and sampled at t=0.995 is ~1 s behind reality at the moment of sampling, and ~2 s behind by the time a client reads the cached value.

Granularity is bounded by the slowest step

For concepts/replication-lag measurement, the effective granularity is the larger of:

  • The heartbeat-injection interval (how often the primary writes a detectable event),
  • The sampling interval (how often the replica monitors the heartbeat table),
  • The staleness of the last read (how recently the throttler refreshed its cached value).

All three must be tightened to tighten the overall observability latency.

The oversampling rule of thumb

Networking-hardware design carries a rule of thumb: sample at 2–5× the rate of the signal you care about to avoid aliasing. For throttler-design, this becomes:

"If the acceptable replication lag is at 5 seconds, then it's best to have a heartbeat/sampling interval of 1–2 seconds." — Noach

See concepts/oversampling-metric-interval for the full articulation.

The release-thundering-herd problem

Long sampling intervals plus multiple throttled jobs plus a shared metric cause synchronised release:

T:  0s     metric > threshold, all jobs blocked
T:  5s     metric drops, sample still shows old value
T: 10s     next sample fires; all jobs see "clear" simultaneously
T: 10s+ε   all jobs push concurrent subtasks; metric spikes
T: 10s+2ε  all jobs blocked again

Shorter intervals smooth this out by giving different jobs slightly different pictures of the metric at their individual check moments, desynchronising the release burst.

Cost of shorter intervals

  • Observability storage scales linearly with sampling rate.
  • Monitor-side CPU (computing the metric) scales linearly.
  • Source-side cost (e.g. heartbeat writes on the primary) scales linearly — every heartbeat is one extra write that ships through every replica's changelog. At high fan-out, this becomes a non-trivial overhead on the replication substrate.

The trade-off is explicit: "lower intervals and more accurate metrics reduce spikes and spread the workload more efficiently. That, too, comes at a cost, which we will discuss in a later post."

Seen in

Last updated · 319 distilled / 1,201 read