Skip to content

SYSTEM Cited by 1 source

Telegraf

Telegraf is InfluxData's open-source server-agent for collecting, processing, aggregating, and writing metrics. Plugin-based: input plugins (OTel, Prometheus scrape, StatsD, system metrics, ...), processor plugins (aggregate, rename, dedup), aggregator plugins (histograms, percentiles, min/max/mean), output plugins (Prometheus remote- write, InfluxDB, Kafka, S3, ...).

Canonical in-corpus role: the metric-aggregation tier in front of a scaled TSDB, where it acts as a cardinality shield by dropping expensive labels before they reach storage.

At Databricks — the cardinality shield in front of Pantheon

Pantheon (Databricks' systems/thanos fork) would not scale if it had to ingest full-cardinality metrics directly from Databricks' fleet — serverless VMs launching "tens of millions daily" drive unbounded label churn. The solution: a Telegraf aggregation pipeline that drops high-cardinality labels (pod IDs, tenant IDs) during ingestion while providing aggregated fleetwide views to service owners.

Scale: >1 GB/s aggregation throughput in the largest region, across thousands of aggregation rules.

Key design choice — sticky routing via Dicer, not Kafka: "These problems are often solved by using a messaging system like Kafka for partitioning assignments and maintaining previous data; this is costly at our scale and adds ingestion delay that impacts real-time usecases." Databricks instead built on Dicer's auto-sharder: metric series are routed stickily to the same Telegraf aggregator across redeployments, so in-memory aggregator state survives without a Kafka-backed durability layer. See patterns/sticky-routing-for-aggregator-state for the full trade-off.

Production validation: during a Databricks infra incident, metrics load surged 2-5× across regions. Telegraf absorbed most of it; Pantheon only saw a 20% surge, so debugging and alerting queries ran unaffected. Canonical validation datum for the aggregation-shield pattern.

Edge-case handling

The main correctness problem for a stateful aggregator tier is counter resets on input pod churn — if an input timeseries disappears and a new one takes its place, the aggregated output counter should continue to increase monotonically rather than dip. Telegraf's native counter semantics plus Databricks' sticky-routing substrate together preserve this invariant without Kafka.

Seen in

Last updated · 451 distilled / 1,324 read