SYSTEM Cited by 9 sources
Prometheus¶
Prometheus is the open-source CNCF time-series database and monitoring system that defines the de-facto standard for metrics in cloud-native infrastructure. Originally built at SoundCloud (2012), it became a CNCF graduated project (2018) and is the default metrics backend for Kubernetes, OpenMetrics, and the OpenTelemetry metrics ecosystem.
Core shape¶
- Pull-based scrape model: Prometheus servers periodically
GET /metricsfrom targets exposing the Prometheus text/OpenMetrics format. Targets don't push; Prometheus pulls. - TSDB storage: local block-based time-series database, append- only with periodic compaction. Retention is a function of disk budget, typically days to weeks locally.
- Metric types: counter, gauge, histogram, summary. Native histograms added later for compact, quantile-friendly bucket encoding.
- Label-based data model: each time series is
metric_name{label=value, ...}. Cardinality is the first-order scaling concern. - PromQL: functional query language optimized for the label model. De-facto query language for cloud-native metrics.
- Rule manager: evaluates recording rules (pre-computed series) and alerting rules on a schedule, writes results back into the TSDB or hands them to Alertmanager.
In production at scale¶
A single Prometheus server scales to millions of series but eventually hits a limit (tens of millions, depending on hardware and query shape). Large orgs therefore build extended architectures on top of Prometheus:
- Remote-write to a central storage backend (VictoriaMetrics, Mimir, Thanos, Cortex) for long-term storage.
- Streaming-aggregation tiers (e.g., systems/vmagent) to collapse high-cardinality raw series before storage.
- Federation proxies (e.g., systems/promxy, Thanos Querier) to present one logical Prometheus over N underlying clusters.
- Custom forks / extensions that hook into Prometheus's
rules/manager.gofor alert backtesting, recording-rule offloading, or compatibility wrappers.
In the wiki¶
Prometheus is the reference point — directly or by contrast — for most of the observability storage posts in the corpus:
- Airbnb observability platform (systems/airbnb-observability-platform) — Prometheus + PromQL as the user-facing query engine over an in-house multi-cluster storage system; vmagent for aggregation, Promxy for federation, Grafana's K8s rollout operators for coordinated deploy.
- Airbnb fault-tolerant metrics storage (systems/airbnb-metrics-storage) — Airbnb's distinct multi-cluster Prometheus storage fleet, ingesting 50M samples/sec across 1.3B active time series (2.5 PB logical data) with per- tenant shuffle sharding.
- Airbnb's alert backtest framework hooks directly into Prometheus's
rules/manager.go— patterns/alert-backtesting — to simulate alert changes against historical TSDB blocks.
Key trade-offs / caveats¶
- Pull model is simple but requires service-discovery machinery to know what to scrape.
- Single-instance Prometheus is not HA — most production deployments run pairs + remote-write, or hand off to one of the long-term-storage forks.
- Cardinality blow-ups (e.g., a label with a user ID in it) are the most common production incident mode.
- PromQL is powerful but expensive on unbounded windows or cardinality; query guardrails matter.
Known extensions / forks / backends on the wiki¶
- systems/vmagent — VictoriaMetrics agent; streaming aggregation.
- systems/promxy — Prometheus federation proxy with custom Airbnb additions (native histogram support, query fanout optimization).
- systems/grafana — the visualization layer most orgs pair with Prometheus.
- systems/opentelemetry — the vendor-neutral collector + SDKs that, via OTLP → Prometheus remote-write, feeds Prometheus-compatible backends.
Seen in¶
- sources/2026-05-05-databricks-10-trillion-samples-a-day-scaling-beyond-traditional-monitoring — Thanos-fork TSDB face at hyperscale. Databricks' old Prometheus-lineage TSDBs — "built for an order of magnitude lower scale" — became "the #1 reliability problem for the entire monitoring infrastructure" because the scale curve made scale-ups daily events rather than rare ones (canonical instance of concepts/tsdb-scaling-bottleneck). The response is Pantheon, a fork of CNCF Thanos scaled to 160+ instances / 5B active timeseries / 10 trillion samples/day / ~70 cloud regions on 3 major clouds. concepts/metric-cardinality is the primary scaling factor, and serverless workload churn ("tens of millions of VMs daily") forced the team to add two Receive groups on distinct memory-retention tiers — see patterns/thanos-receive-groups-with-memory-retention-tiers. Canonical instance of the post-Prometheus-scaling-wall architecture layered on top of the Prometheus query language (PromQL is preserved; storage substrate is fundamentally different).
- sources/2026-05-05-airbnb-monitoring-reliably-at-scale — Prometheus as the meta-monitoring tier at Airbnb: a separate Prometheus fleet dedicated to watching the main observability stack. Verbatim: "At Airbnb, we run a separate set of Prometheus instances dedicated to monitoring our observability stack." Deployment discipline: HA pairs on Kubernetes nodes isolated from the observability stack + distinct availability zones + pair-level anti-affinity with the systems/alertmanager HA pairs. The Prometheus meta-tier emits an always-firing alerting rule that Alertmanager pushes to AWS SNS — the rate of those messages is watched by CloudWatch as a dead-man's switch. Canonical wiki instance of Prometheus in the meta-monitoring role (distinct from the primary production-metrics role). Composes with patterns/ha-set-anti-affinity-across-shared-infra and patterns/heartbeat-absence-as-alert-trigger.
- — Brian Morrison II's 2023-11-15 PlanetScale best-practices post names Prometheus as PlanetScale's chosen replication monitoring stack: "At PlanetScale, we use Prometheus to monitor replication, along with other metrics, for the clusters we manage." Named alongside SolarWinds Database Performance (formerly VividCortex) as the two production options Morrison discusses. Canonical wiki datum: Prometheus is the fleet-metric tier underneath the PlanetScale observability stack (systems/planetscale-insights for query-tier; Vitess tablet throttler systems/vitess-throttler for replication-lag-driven admission control; Prometheus for fleet metrics and the load-bearing replication-lag signal that operators use to notice a quietly failing replica before failover).
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — storage system built on Prometheus at 50M samples/sec with per-tenant shuffle sharding, query sharding, and multi-cluster federation via Promxy.
- sources/2026-03-17-airbnb-observability-ownership-migration — Airbnb's migration from a vendor to Prometheus + PromQL.
- sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — OTLP and StatsD data paths into Prometheus-compatible storage.
- sources/2026-03-04-airbnb-alert-backtesting-change-reports —
hooking into
rules/manager.gofor backtesting at CI time. - sources/2026-03-31-slack-from-custom-to-open-scalable-network-probing-and-http3-readiness
— Slack names its Prometheus stack as the monitoring backbone
for edge network probing; the Prometheus-ecosystem
Blackbox Exporter is
"a cornerstone of our monitoring", and Slack intern Sebastian
Feliciano open-sourced HTTP/3/QUIC support into BBE upstream
(with an in-house integration running in parallel) to close
the concepts/http-3-probing-gap before HTTP/3 rolled out
on the edge. Canonical wiki datum: Prometheus + BBE is the
client-side black-box
probing substrate, distinct from Prometheus's primary scrape
path from services' own
/metricsendpoints.