Skip to content

PATTERN Cited by 1 source

Aggregation shield for TSDB cardinality

Place a dedicated metric-aggregation tier in front of the TSDB that drops expensive labels during ingestion, converting high-cardinality raw series into lower-cardinality aggregated series before they reach storage. The TSDB's cardinality growth rate is then bounded by the (much smaller) set of distinct aggregation keys, not by the upstream infrastructure growth rate.

The shield insulates the TSDB from two independent pressures:

  1. Long-term cardinality growth driven by fleet expansion and serverless workload churn.
  2. Incident-driven metric surges — when something breaks, services often emit more metrics, not fewer (retries, error series with new labels, autoscaler-triggered new pods). The TSDB is under query pressure at exactly the moment it's also under ingestion pressure.

Shape

  Applications                        Aggregation tier                 TSDB
  (all labels)      ─────▶     (drops expensive labels)     ─────▶   (bounded)
   pod_id=...                        ↓                               aggregated
   tenant_id=...                     metric{region=..., tenant=...}  series
   node_id=...                       sum / avg / histogram / …
   region=...

Key design decisions:

  • Stateful aggregation — the aggregators hold in-memory state (running counters, percentile reservoirs, histogram buckets). This state needs to survive redeployments; see the companion pattern patterns/sticky-routing-for-aggregator-state.
  • Rule-driven — aggregation rules define which labels to drop and which aggregation function to apply per metric. Rules are centrally managed.
  • Monotonic counter preservation — when an input series disappears (pod termination), the aggregated counter must not reset. Aggregators track per-input-series last-seen values and handle the transition.

Production validation: surge absorption

The shield's load-bearing property — separating TSDB scale from infra scale — is validated by surge absorption:

  • A 2-5× metric surge during an infra incident at Databricks.
  • Telegraf absorbed the bulk of it.
  • Pantheon only saw a 20% surge — enough headroom for engineers to run debugging and alerting queries without impact.

The shield is explicitly most valuable during incidents. Without it, the TSDB would be fighting query load during exactly the moments it was also being hammered with ingestion.

The cost: debugging fidelity lost

Aggregation drops exactly the dimensions needed during incidents — "which pod crashed", "which tenant", "which node". Hyperscale deployments complement the shield with a raw-data tier stored elsewhere (see systems/hydra and patterns/dual-tier-observability-tsdb-plus-lakehouse) to recover debugging fidelity without pushing cardinality back onto the TSDB.

Scaling numbers

At Databricks:

  • >1 GB/s aggregation throughput in largest region.
  • Thousands of aggregation rules.
  • Built on Telegraf + Dicer sticky routing.

Sibling patterns

Seen in

Last updated · 451 distilled / 1,324 read