SYSTEM Cited by 1 source
Telegraf¶
Telegraf is InfluxData's open-source server-agent for collecting, processing, aggregating, and writing metrics. Plugin-based: input plugins (OTel, Prometheus scrape, StatsD, system metrics, ...), processor plugins (aggregate, rename, dedup), aggregator plugins (histograms, percentiles, min/max/mean), output plugins (Prometheus remote- write, InfluxDB, Kafka, S3, ...).
Canonical in-corpus role: the metric-aggregation tier in front of a scaled TSDB, where it acts as a cardinality shield by dropping expensive labels before they reach storage.
At Databricks — the cardinality shield in front of Pantheon¶
Pantheon (Databricks' systems/thanos fork) would not scale if it had to ingest full-cardinality metrics directly from Databricks' fleet — serverless VMs launching "tens of millions daily" drive unbounded label churn. The solution: a Telegraf aggregation pipeline that drops high-cardinality labels (pod IDs, tenant IDs) during ingestion while providing aggregated fleetwide views to service owners.
Scale: >1 GB/s aggregation throughput in the largest region, across thousands of aggregation rules.
Key design choice — sticky routing via Dicer, not Kafka: "These problems are often solved by using a messaging system like Kafka for partitioning assignments and maintaining previous data; this is costly at our scale and adds ingestion delay that impacts real-time usecases." Databricks instead built on Dicer's auto-sharder: metric series are routed stickily to the same Telegraf aggregator across redeployments, so in-memory aggregator state survives without a Kafka-backed durability layer. See patterns/sticky-routing-for-aggregator-state for the full trade-off.
Production validation: during a Databricks infra incident, metrics load surged 2-5× across regions. Telegraf absorbed most of it; Pantheon only saw a 20% surge, so debugging and alerting queries ran unaffected. Canonical validation datum for the aggregation-shield pattern.
Edge-case handling¶
The main correctness problem for a stateful aggregator tier is counter resets on input pod churn — if an input timeseries disappears and a new one takes its place, the aggregated output counter should continue to increase monotonically rather than dip. Telegraf's native counter semantics plus Databricks' sticky-routing substrate together preserve this invariant without Kafka.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical source. Telegraf + systems/dicer is the aggregation tier in front of Pantheon. Sustains >1 GB/s per region, thousands of aggregation rules, absorbs 2-5× metric surges during incidents so Pantheon only sees 20%. Sticky routing via Dicer trades Kafka's explicit durability for cheaper, lower-latency in-memory state.
Related¶
- systems/dicer — auto-sharder that drives sticky routing
- systems/pantheon — downstream TSDB
- systems/thanos
- systems/databricks
- companies/databricks
- concepts/metric-aggregation-as-cardinality-shield
- concepts/metric-cardinality
- concepts/sticky-routing
- patterns/aggregation-shield-for-tsdb-cardinality
- patterns/sticky-routing-for-aggregator-state