SYSTEM Cited by 1 source
Promxy¶
Promxy is an open-source proxy over Prometheus that presents N independent Prometheus servers / storage backends as a single logical Prometheus. Applications (Grafana dashboards, alerting rules, API clients) talk to Promxy as if they were talking to one Prometheus; Promxy fans out each query to the underlying clusters, aggregates the responses, and returns a unified result.
Core responsibilities¶
- Query fanout: parse the incoming PromQL, figure out which backends might own matching series, dispatch sub-queries in parallel, and aggregate responses on the fanout side.
- Aggregation: combine partial results — e.g.,
sum()over series that exist in multiple backends merges the counts rather than double-counting. - Unified service discovery: one endpoint for dashboards; the existence of multiple backend clusters is transparent.
- Graceful degradation: when a backend is unhealthy, Promxy surfaces the partial result (with error annotations) rather than failing the whole query.
Why orgs pick it¶
Once a Prometheus deployment is large enough to be split across multiple clusters (for blast-radius, regional, or workload- segregation reasons — see concepts/active-multi-cluster-blast-radius), there's a choice: either teach every downstream consumer about the cluster topology, or put a proxy in front of the clusters that hides it. Promxy is the proxy.
Alternative in the same space: Thanos Querier, Cortex frontend. Promxy is a lighter, single-binary alternative.
Airbnb's custom additions¶
From the 2026-04-21 Airbnb fault-tolerant-metrics-storage post, Airbnb leverages Promxy OSS but has added custom functionality tailored to their needs:
- Native histogram support — Prometheus native histograms (the compact, quantile-friendly bucket encoding) weren't supported in upstream Promxy at the time; Airbnb added it so cross-cluster queries over histogram series produce correct p99/p95 numbers.
- Query fanout optimization — the default fanout dispatches every sub-query to every backend. Airbnb's custom logic narrows the fanout to only the relevant cluster(s) based on the tenant → cluster map, reducing wasted work and federated-query cost.
These enhancements enable cross-cluster querying and alerting tailored to Airbnb's multi-cluster metrics storage. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)
Trade-offs¶
- Federated-query cost: even with fanout optimization, cross- cluster queries remain more expensive than single-cluster queries — Airbnb reports 5–10× cost amplification. See concepts/cross-cluster-federated-query-cost.
- Proxy as shared failure domain: if Promxy goes down, all downstream consumers lose access to all clusters, even though each underlying cluster is healthy. Typical mitigation: run Promxy HA with multiple replicas in front of it.
- Partial-result ambiguity: a query that returned data from 9/10 backends looks superficially similar to one that returned from 10/10; dashboards need to surface the difference or users may silently make wrong decisions.
Seen in¶
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — Airbnb uses Promxy as the federation layer across their multi-cluster Prometheus-compatible storage fleet, with custom native-histogram support and query-fanout optimization to tame the 5–10× cross-cluster query-cost tax.