SYSTEM Cited by 4 sources
Prometheus¶
Prometheus is the open-source CNCF time-series database and monitoring system that defines the de-facto standard for metrics in cloud-native infrastructure. Originally built at SoundCloud (2012), it became a CNCF graduated project (2018) and is the default metrics backend for Kubernetes, OpenMetrics, and the OpenTelemetry metrics ecosystem.
Core shape¶
- Pull-based scrape model: Prometheus servers periodically
GET /metricsfrom targets exposing the Prometheus text/OpenMetrics format. Targets don't push; Prometheus pulls. - TSDB storage: local block-based time-series database, append- only with periodic compaction. Retention is a function of disk budget, typically days to weeks locally.
- Metric types: counter, gauge, histogram, summary. Native histograms added later for compact, quantile-friendly bucket encoding.
- Label-based data model: each time series is
metric_name{label=value, ...}. Cardinality is the first-order scaling concern. - PromQL: functional query language optimized for the label model. De-facto query language for cloud-native metrics.
- Rule manager: evaluates recording rules (pre-computed series) and alerting rules on a schedule, writes results back into the TSDB or hands them to Alertmanager.
In production at scale¶
A single Prometheus server scales to millions of series but eventually hits a limit (tens of millions, depending on hardware and query shape). Large orgs therefore build extended architectures on top of Prometheus:
- Remote-write to a central storage backend (VictoriaMetrics, Mimir, Thanos, Cortex) for long-term storage.
- Streaming-aggregation tiers (e.g., systems/vmagent) to collapse high-cardinality raw series before storage.
- Federation proxies (e.g., systems/promxy, Thanos Querier) to present one logical Prometheus over N underlying clusters.
- Custom forks / extensions that hook into Prometheus's
rules/manager.gofor alert backtesting, recording-rule offloading, or compatibility wrappers.
In the wiki¶
Prometheus is the reference point — directly or by contrast — for most of the observability storage posts in the corpus:
- Airbnb observability platform (systems/airbnb-observability-platform) — Prometheus + PromQL as the user-facing query engine over an in-house multi-cluster storage system; vmagent for aggregation, Promxy for federation, Grafana's K8s rollout operators for coordinated deploy.
- Airbnb fault-tolerant metrics storage (systems/airbnb-metrics-storage) — Airbnb's distinct multi-cluster Prometheus storage fleet, ingesting 50M samples/sec across 1.3B active time series (2.5 PB logical data) with per- tenant shuffle sharding.
- Airbnb's alert backtest framework hooks directly into Prometheus's
rules/manager.go— patterns/alert-backtesting — to simulate alert changes against historical TSDB blocks.
Key trade-offs / caveats¶
- Pull model is simple but requires service-discovery machinery to know what to scrape.
- Single-instance Prometheus is not HA — most production deployments run pairs + remote-write, or hand off to one of the long-term-storage forks.
- Cardinality blow-ups (e.g., a label with a user ID in it) are the most common production incident mode.
- PromQL is powerful but expensive on unbounded windows or cardinality; query guardrails matter.
Known extensions / forks / backends on the wiki¶
- systems/vmagent — VictoriaMetrics agent; streaming aggregation.
- systems/promxy — Prometheus federation proxy with custom Airbnb additions (native histogram support, query fanout optimization).
- systems/grafana — the visualization layer most orgs pair with Prometheus.
- systems/opentelemetry — the vendor-neutral collector + SDKs that, via OTLP → Prometheus remote-write, feeds Prometheus-compatible backends.
Seen in¶
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — storage system built on Prometheus at 50M samples/sec with per-tenant shuffle sharding, query sharding, and multi-cluster federation via Promxy.
- sources/2026-03-17-airbnb-observability-ownership-migration — Airbnb's migration from a vendor to Prometheus + PromQL.
- sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline — OTLP and StatsD data paths into Prometheus-compatible storage.
- sources/2026-03-04-airbnb-alert-backtesting-change-reports —
hooking into
rules/manager.gofor backtesting at CI time.