PATTERN Cited by 1 source
Thanos Receive groups with memory-retention tiers¶
Run two (or more) Thanos Receive groups with different memory-retention windows, each tuned for the lifespan of the workloads whose metrics it ingests. Short retention for ephemeral (serverless / autoscaled) workloads; long retention for persistent services.
This is a direct counter-pattern to the default "one Receive group / one retention window for everything" posture, which forces the retention knob to be set for the worst-case workload — wasteful when most metrics churn fast.
Shape¶
- Long-lived Receive group — 2 hours memory retention. Receives metrics from long-running services (control-plane components, stateful databases, long-lived microservices). The longer window is justified because these metrics tend to be queried at full fidelity, and retaining them in memory avoids object-storage fetch latency for interactive queries.
- Ephemeral Receive group — 30 minutes memory retention. Receives metrics from short-lived serverless workloads whose pod / VM IDs churn rapidly (see concepts/serverless-workload-churn-cardinality). A short window bounds the in-memory series count for these workloads because most label values expire before the retention window closes.
Each group independently implements the standard Thanos Receive design: quorum writes across replicas, block flush to on-disk tier at the 2h / 30m mark, eventual upload to object storage.
Why it works¶
Thanos' dominant memory cost is proportional to active in- memory series count × retention window. If 80% of series originate from short-lived workloads with a mean lifetime of ~5 minutes, keeping them in memory for 2 hours means most of those series are already "dead" (no more samples arriving) but still held in the index. Cutting the window to 30 minutes drops the memory cost of those series while still preserving the most recent data — the data engineers actually query during incidents.
Assignment: which group gets which metric?¶
Router-layer rule-based tenant attribution. The router inspects metric names / selected labels and routes to the appropriate Receive group. No upstream client changes required — a service that moves from persistent to ephemeral deployment model still emits the same metric, and the router re-routes it.
Observed outcome¶
At Databricks:
- 5B active timeseries across the fleet.
- 300M active series on the largest single Pantheon instance.
- ~1,000 PromQL QPS on that instance.
- Memory footprint significantly reduced vs single-group equivalent.
Companions¶
- patterns/aggregation-shield-for-tsdb-cardinality — the upstream shield that bounds what either Receive group has to hold. Works in series with this pattern.
- patterns/purpose-built-control-plane-for-stateful-tsdb — the control-plane automation that makes running multiple Receive groups operationally tractable (Rollout Operator / Hashring Controller coordinate across all groups).
Failure modes¶
- Misclassified workloads — a workload wrongly assigned to the ephemeral group loses historical data earlier than expected. Mitigation: explicit routing rules + regular audit.
- Cross-group queries — PromQL queries spanning both groups need Querier federation; any bug in deduplication surfaces as duplicate or missing samples.
- Group imbalance — traffic imbalance between groups leaves one starved for resources and the other over-provisioned. Autoscaling signals must be per-group.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — canonical instance. Pantheon runs exactly this pattern: 2h Receive group + 30m Receive group, router-layer tenant attribution, each group as three isolated Kubernetes StatefulSets. "This split reflects the lifespan we observed for serverless workloads at Databricks, and significantly reduces memory footprint and cloud cost while preserving correctness."