Airbnb — Building a fault-tolerant metrics storage system at Airbnb¶
Summary¶
Airbnb's Observability team moved their metrics stack from a hosted vendor to an internally operated system that ingests 50M samples/sec and persists 2.5 PB of logical time-series data across 1.3B active time series. Author Rishabh Kumar walks through four design challenges and their solutions: (1) multi-tenancy (rejected tenant-per-team; chose tenant-per-application for ~1,000 services; shuffle sharding for per-tenant read/write isolation; a consolidated control plane for auto-onboarding), (2) single- cluster reliability (per-replica limits, query sharding, compaction sharding at 8M series/worker, zone-aware stateful components across three zones), (3) multi-cluster federation (dedicated clusters for specialised workloads, progressive rollout from test → internal → application → infrastructure clusters, custom Promxy extensions for cross-cluster querying), and (4) operational philosophy (Grafana Kubernetes rollout operators replace multi-day manual deploys; clusters as cattle, not pets; cross-cluster queries are 5–10× costlier than single-cluster and must be actively managed).
Key takeaways¶
-
Tenant-per-application beats tenant-per-team because team ownership of apps changes frequently, while services are long- lived and attribute metric growth precisely. ~1,000 services = ~1,000 tenants. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)
-
Shuffle sharding isolates both reads and writes: each tenant hashes to a subset of ingesters and a subset of query workers. A DDoS or bad query from tenant A can knock out A's shuffle set without impacting tenants D & E on disjoint shards. Visual: 4 shards, 5 tenants, attacker-affected shards ∩ victim shards = the blast radius. See concepts/shuffle-sharding.
-
Expose few guardrails, derive the rest. Airbnb exposes per-tenant series limits and derives ingestion rate + burst size from them. Fewer knobs = less operator confusion when a tenant hits a limit. See patterns/tenant-per-application.
-
Query payload variability is the hard thing to benchmark. Write components benchmark cleanly against samples-per-second. Read components don't — payload size (series count / chunks / MB) dominates cost. Guardrails: max 5,000 series / max 500 MB per query are the order-of-magnitude limits beyond which Airbnb saw degradation.
-
Separate evaluation query path from ad-hoc query path. Alert rule evaluation is critical-path; dashboard / CI queries are bursty and less critical. Splitting them prevents a dashboard user from degrading alerting.
-
Compaction at scale is workload-sharded. For very large tenants, compaction workers process up to 8M series each, ensuring data being read is always compacted (vs. being partway through compaction, which degrades query performance).
-
Zone-aware stateful components across three zones is the baseline for fault tolerance — survives zonal outages, survives node rotation during deploys.
-
Multi-cluster architecture for blast radius. Dedicated clusters per specialised workload (compute, mesh) + multiple application clusters. Failure in one doesn't cascade. See concepts/active-multi-cluster-blast-radius.
-
Progressive cluster rollout by criticality: test → internal → application → infrastructure. The last tier carries the highest- criticality workloads where a regression would cause "flying blind" — they roll out last, with the longest soak. This is the mechanism that hits the >99.9% availability target. See patterns/progressive-cluster-rollout.
-
Cross-cluster (federated) queries are 5–10× more expensive than queries within a single cluster. A handful of expensive federated queries was enough to cause read reliability issues across multiple clusters. Forced adjustments to tenant- consolidation strategy around hot read patterns. See concepts/cross-cluster-federated-query-cost.
-
Deployment consistency is the long-term cost of multi-cluster. Manual stateful-app deploys were too slow (days) and caused config drift. Grafana's Kubernetes OSS rollout operators, customised to respect Airbnb's pod-disruption-budget requirements, replaced the manual process. Seamless deploys also reduce configuration drift across clusters.
-
Custom Promxy for cross-cluster querying: Airbnb added native histogram support and query fanout optimization on top of the OSS systems/promxy. Upstream Promxy didn't handle native histograms at the time, breaking p99/p95 correctness across clusters; the fanout optimization narrows dispatch to relevant cluster(s) rather than broadcasting to every backend.
-
Clusters as cattle, not pets. Single-cluster scaling and stabilisation work made clusters self-tune, enabling new clusters to be added or replaced with minimal operational overhead — the operational north-star philosophy for the multi-cluster era.
Systems / concepts / patterns extracted¶
- Systems: systems/airbnb-metrics-storage, systems/airbnb-observability-platform, systems/prometheus, systems/promxy, systems/vmagent, systems/grafana.
- Concepts: concepts/shuffle-sharding, concepts/active-multi-cluster-blast-radius, concepts/cross-cluster-federated-query-cost, concepts/blast-radius, concepts/tenant-isolation, concepts/performance-isolation, concepts/observability.
- Patterns: patterns/tenant-per-application, patterns/progressive-cluster-rollout, patterns/workload-segregated-clusters.
Operational numbers¶
- 50M samples/sec ingestion (peak).
- 1.3B active time series on average.
- 2.5 PB logical time-series data stored.
- ~1,000 services = ~1,000 tenants.
- 10,000 dashboards supported.
- 500,000 alerts in flight.
- p99 query execution time < 30 s — SLO target.
- >99.9% availability — achieved with multi-cluster federation.
- >5,000 series / >500 MB per query — the payload size at which query performance degrades noticeably (numbers will vary per querying system).
- 8M series per compaction worker — sharding unit for large- tenant compaction.
- 3 zones — zonal deployment baseline for stateful components.
- 5–10× cost amplification on cross-cluster queries vs. single- cluster.
- Days → seamless — deployment duration for stateful apps, before → after Grafana K8s rollout operators + Airbnb pod- disruption-budget compatibility work.
Caveats¶
- The post is a high-level survey of the storage system's design — it doesn't name the actual TSDB implementation (VictoriaMetrics? Mimir? Cortex? A custom fork?). We can infer from context (vmagent
- Promxy + PromQL + "treated as cattle" ops philosophy) that it's VictoriaMetrics / VM-compatible, but this is not stated in the post.
- No sample architecture diagram included in the extracted markdown (the raw file has "Press enter or click to view image in full size" placeholders). Key diagrams referenced: the shuffle-sharding tenant-impact diagram, the progressive-cluster- rollout diagram.
- "5–10× more expensive" for federated queries is unitless — the post doesn't specify CPU-seconds vs. wall-clock vs. bytes- scanned. The multiplier is a rough order-of-magnitude claim, not a measured distribution.
- No failure-mode details on the Promxy custom extensions — how the query-fanout optimization picks which clusters to fan out to, how tenant-cluster mapping is kept in sync with the routing layer, how partial-result errors surface.
- No cost numbers — the post doesn't disclose cost-per-sample or the overall footprint dollar value. Scale numbers only.
- No vendor-to-Airbnb migration timeline in this post — referenced in a separate 2026-03-17 migration retrospective (sources/2026-03-17-airbnb-observability-ownership-migration).
- Per-tenant shuffle-set sizing (K value) not disclosed — only the diagram (4 shards, 5 tenants) as an illustrative example.
- The ~1,000 services number is a snapshot; service count trends up with product growth and will change. Shuffle-sharding math only works well above ~100 tenants.
- Autoscaling specifics (triggers, scale-up/down SLAs) not disclosed for the read tier.
Source¶
- Original: https://medium.com/airbnb-engineering/building-a-fault-tolerant-metrics-storage-system-at-airbnb-26a01a6e7017?source=rss----53c7c27702d5---4
- Raw markdown:
raw/airbnb/2026-04-21-building-a-fault-tolerant-metrics-storage-system-at-airbnb-4a3fea55.md
Related¶
- systems/airbnb-metrics-storage
- systems/airbnb-observability-platform
- systems/prometheus
- systems/promxy
- concepts/shuffle-sharding
- concepts/active-multi-cluster-blast-radius
- concepts/cross-cluster-federated-query-cost
- concepts/blast-radius
- concepts/tenant-isolation
- patterns/tenant-per-application
- patterns/progressive-cluster-rollout
- patterns/workload-segregated-clusters
- companies/airbnb