Skip to content

SYSTEM Cited by 4 sources

Prometheus

Prometheus is the open-source CNCF time-series database and monitoring system that defines the de-facto standard for metrics in cloud-native infrastructure. Originally built at SoundCloud (2012), it became a CNCF graduated project (2018) and is the default metrics backend for Kubernetes, OpenMetrics, and the OpenTelemetry metrics ecosystem.

Core shape

  • Pull-based scrape model: Prometheus servers periodically GET /metrics from targets exposing the Prometheus text/OpenMetrics format. Targets don't push; Prometheus pulls.
  • TSDB storage: local block-based time-series database, append- only with periodic compaction. Retention is a function of disk budget, typically days to weeks locally.
  • Metric types: counter, gauge, histogram, summary. Native histograms added later for compact, quantile-friendly bucket encoding.
  • Label-based data model: each time series is metric_name{label=value, ...}. Cardinality is the first-order scaling concern.
  • PromQL: functional query language optimized for the label model. De-facto query language for cloud-native metrics.
  • Rule manager: evaluates recording rules (pre-computed series) and alerting rules on a schedule, writes results back into the TSDB or hands them to Alertmanager.

In production at scale

A single Prometheus server scales to millions of series but eventually hits a limit (tens of millions, depending on hardware and query shape). Large orgs therefore build extended architectures on top of Prometheus:

  • Remote-write to a central storage backend (VictoriaMetrics, Mimir, Thanos, Cortex) for long-term storage.
  • Streaming-aggregation tiers (e.g., systems/vmagent) to collapse high-cardinality raw series before storage.
  • Federation proxies (e.g., systems/promxy, Thanos Querier) to present one logical Prometheus over N underlying clusters.
  • Custom forks / extensions that hook into Prometheus's rules/manager.go for alert backtesting, recording-rule offloading, or compatibility wrappers.

In the wiki

Prometheus is the reference point — directly or by contrast — for most of the observability storage posts in the corpus:

  • Airbnb observability platform (systems/airbnb-observability-platform) — Prometheus + PromQL as the user-facing query engine over an in-house multi-cluster storage system; vmagent for aggregation, Promxy for federation, Grafana's K8s rollout operators for coordinated deploy.
  • Airbnb fault-tolerant metrics storage (systems/airbnb-metrics-storage) — Airbnb's distinct multi-cluster Prometheus storage fleet, ingesting 50M samples/sec across 1.3B active time series (2.5 PB logical data) with per- tenant shuffle sharding.
  • Airbnb's alert backtest framework hooks directly into Prometheus's rules/manager.gopatterns/alert-backtesting — to simulate alert changes against historical TSDB blocks.

Key trade-offs / caveats

  • Pull model is simple but requires service-discovery machinery to know what to scrape.
  • Single-instance Prometheus is not HA — most production deployments run pairs + remote-write, or hand off to one of the long-term-storage forks.
  • Cardinality blow-ups (e.g., a label with a user ID in it) are the most common production incident mode.
  • PromQL is powerful but expensive on unbounded windows or cardinality; query guardrails matter.

Known extensions / forks / backends on the wiki

  • systems/vmagent — VictoriaMetrics agent; streaming aggregation.
  • systems/promxy — Prometheus federation proxy with custom Airbnb additions (native histogram support, query fanout optimization).
  • systems/grafana — the visualization layer most orgs pair with Prometheus.
  • systems/opentelemetry — the vendor-neutral collector + SDKs that, via OTLP → Prometheus remote-write, feeds Prometheus-compatible backends.

Seen in

Last updated · 319 distilled / 1,201 read