Skip to content

CONCEPT Cited by 1 source

Cross-cluster federated query cost

The observation

When a multi-tenant storage system is split into multiple failure- domain clusters for blast-radius reasons (see concepts/active-multi-cluster-blast-radius), queries that span clusters — federated queries — are materially more expensive than queries that stay within a single cluster. Airbnb's observability team measured this concretely:

"Federated queries are significantly more resource-intensive, typically 5–10× costlier than queries within a single cluster." (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)

The consequence is operational, not just performance: a small number of expensive federated queries can cause read reliability issues across multiple clusters simultaneously — the very clusters that were separated for blast-radius reasons are now coupled through the federation proxy's query load.

Why federated queries are costlier

  • Fanout coordination: the federation proxy has to dispatch one sub-query per cluster, wait for the slowest responder, and aggregate. Tail latency dominates.
  • More series scanned: a cross-cluster query like "join application metrics with host metrics" has to scan series in every cluster that could conceivably hold either side.
  • No pushdown guarantees: aggregation steps that would collapse cardinality locally (sum by (region)) may not be pushed down to each cluster's query engine, so the proxy receives unaggregated series and has to do the aggregation itself.
  • Network transfer: the proxy receives the full series payload from each cluster, not the aggregate.

Design consequences

Airbnb adjusted their tenant-consolidation strategy because of this: the initial premise ("shard tenants across clusters for blast-radius reasons") had to be balanced against read locality — tenants whose hot queries join data owned by multiple clusters needed to either be collocated in the same cluster, or have their queries rewritten to not span clusters.

Key heuristics that emerge:

  1. Identify hot cross-cluster query patterns early — they are load-bearing for tenant-placement decisions.
  2. Don't optimise only for write-path isolation; read-path cost must inform the clusterization strategy.
  3. Cap cross-cluster concurrency in the federation proxy — without guardrails a handful of expensive queries can saturate the fanout layer and take down reads across the fleet.
  4. Invest in federation-proxy query optimisations (histogram support, pushdown, fanout limits) — Airbnb built custom additions to Promxy specifically to tame cross-cluster cost.

Contrast with single-cluster queries

A pure single-cluster query benefits from: - Co-located series: the scan happens locally, no network fanout. - Local aggregation: query-sharding and chunk-prefetching within the cluster's query engine. - Predictable guardrails: per-tenant read quotas apply cleanly without federation-proxy-layer approximation.

Most queries on Airbnb's metrics system are intra-cluster; the cross-cluster 5–10× tax only applies to the minority of queries that legitimately need multiple clusters' data. But because a small number of those queries can dominate global load, they warrant dedicated attention.

Caveats

  • The 5–10× number is specific to Airbnb's Promxy-based Prometheus federation at their query-payload distribution. Other federation systems (Thanos, Mimir, Cortex) will have different multipliers.
  • The multiplier is a function of query shape (simple lookup vs. big join), cluster count, and cross-cluster network characteristics.
  • Queries that could be pushed down as aggregates may close the gap significantly.

Seen in

  • sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — Airbnb quantifies federated-query cost at 5–10× single-cluster cost in their Promxy-fronted multi-cluster metrics storage, and calls out that a handful of expensive federated queries caused read reliability issues across multiple clusters — forcing them to adjust tenant-consolidation strategy around hot read patterns.
Last updated · 319 distilled / 1,201 read