SYSTEM Cited by 1 source
Airbnb fault-tolerant metrics storage¶
The storage-plane subsystem inside systems/airbnb-observability-platform — the multi-cluster, multi-tenant time-series storage fleet that persists and serves Airbnb's metrics. Moved from a hosted metrics provider to an internally operated system. This is the system described in the 2026-04-21 post Building a fault-tolerant metrics storage system at Airbnb by Rishabh Kumar.
Scale¶
- ~1,000 services → ~1,000 tenants (one-to-one; see patterns/tenant-per-application).
- 50M samples/sec peak ingestion.
- 1.3B active time series on average.
- 2.5 PB of logical time-series data stored.
- SLO target: p99 query execution time < 30 s with 10,000 dashboards + 500,000 alerts in flight.
- Availability: >99.9% with multi-cluster federation.
Architecture — four load-bearing ideas¶
1. Shuffle sharding for tenant isolation¶
Every tenant (application / service) is mapped to a subset of storage nodes on both the write and read paths:
- Writes: ingesters hash the tenant → K-node shuffle set. A tenant's DDoS / ingestion burst can saturate at most its K-node shuffle set, not the whole fleet.
- Reads: Grafana queries hash the tenant → K-node subset of query workers. A runaway query (e.g., >5,000 series or 500 MB+ payload) impacts at most K workers.
- Diagram from the post: four shards, five tenants A–E. DDoS from tenant A can take down shards 1 & 2 (affecting overlapping tenants B & C) but tenants D & E remain healthy on shards 3 & 4.
See concepts/shuffle-sharding for the general primitive.
2. Tenant-per-application with per-tenant guardrails¶
Alternatives rejected: - Tenant-per-team — team ownership changes too often; unstable.
Chosen: tenant-per-application. With ~1,000 services, this is a logical and stable grouping. Benefits: - Precise attribution of metric growth to specific applications. - Foundation for chargeback in future. - Meaningful per-tenant guardrails.
Guardrails exposed vs. derived: - Exposed: series limit. - Derived from series limit: ingestion rate, ingestion burst size. - Other per-tenant knobs: rule count, evaluation interval / offset, max series per query, max chunks per query.
Automated control plane: new service creation auto-enrolls as a tenant; config changes apply via single deployment rather than component-by-component manual editing.
See patterns/tenant-per-application.
3. Single-cluster reliability, then multi-cluster federation¶
Phase 1 — run a single cluster reliably.
- Writes stabilisation: benchmark components, set per-replica limits based on ingest rate + inflight requests, set write guardrails starting with max time series per tenant (one-month lookback period, continuously adjusted).
- Reads stabilisation: query sharding to normalise load on query workers; per-tenant read guardrails (fetched series per query, fetched chunks per query); separate evaluation query path from ad-hoc query path (dashboards, CI jobs) because the criticality is different; autoscaling the read tier for intraday throughput variation.
- Compaction stabilisation: for large tenants, compaction workloads sharded such that each worker processes up to 8M series, ensuring reads always see compacted data.
- Zone-awareness: both write and read stateful components deployed across three zones, so a zonal outage or node rotation is survivable.
Phase 2 — move to multi-cluster for blast-radius reasons.
Concerns that drove the shift: - Bad queries / bad deploys can degrade an entire cluster → "flying blind" risk for everyone on that cluster. - Desire for regional flexibility.
Clusterization strategy: - Dedicated clusters for specialized workloads (compute and mesh infrastructure). - Multiple application clusters for general workloads. - Failure in one cluster does not cascade to others.
Rollout strategy: progressive cluster rollout — test clusters → internal clusters → application clusters → infrastructure clusters — sequenced by criticality, not size. This is what achieves >99.9% availability in practice. See patterns/progressive-cluster-rollout.
4. Federation via Promxy with custom query-fanout¶
Cross-cluster querying is served by a Promxy-based federation proxy with Airbnb-specific additions: - Native histogram support for correct p99 / p95 over histogram series that span clusters. - Query fanout optimization to narrow fanout to only the relevant cluster(s), reducing wasted work.
See systems/promxy for the proxy; concepts/cross-cluster-federated-query-cost for the 5–10× cost amplification observation that forced adjustments in tenant-consolidation strategy.
Deployment¶
- Stateful apps: deployed via Grafana's Kubernetes OSS rollout operators, enabling coordinated multi-AZ rolls across StatefulSets within a namespace. Replaced a manual, sequential process that took days.
- Rollout operators made compatible with Airbnb's cloud infra, respecting pod disruption budget requirements while rollouts and node operations happen concurrently.
- Automation here is what enabled the "treat clusters as cattle, not pets" operational philosophy — new clusters can be added or replaced with minimal operational overhead.
Key design choices / lessons¶
- Per-replica limits provide actionable scaling signals for fleet management (vs. rate-based or tenant-aggregate limits which don't tell an on-call what to scale).
- Tenant-level controls shield the system from disruptive behaviour — both malicious (DDoS) and accidental (runaway query, cardinality explosion).
- Multi-zone stateful deployments enhance both fault tolerance and deployment agility (you can drain a zone for maintenance during a rollout).
- Cross-cluster queries are 5–10× more expensive than single- cluster queries — a handful of expensive federated queries were enough to cause read reliability issues across clusters. This forced adjustments to tenant-consolidation strategy.
- Deployment consistency matters: manual stateful-app deploys were the single largest source of configuration drift; automation (Grafana rollout operators) was the fix.
- Cluster management philosophy: single-cluster scaling and stabilisation work enabled clusters to self-tune, which in turn enabled "clusters as cattle, not pets" — adding or replacing clusters with minimal additional operational overhead.
Relationship to the rest of the wiki¶
- Parent system: systems/airbnb-observability-platform is
the umbrella platform (instrumentation → collection → storage →
visualization → alerting). This page covers the storage plane
specifically — the multi-cluster, multi-tenant fleet that persists
and serves metrics;
airbnb-observability-platformcovers the instrumentation (OTLP, StatsD) + aggregation (systems/vmagent) - alerting framework + Change Reports that wrap around it.
- Storage-compatible with systems/prometheus — PromQL is the query language exposed to dashboards and alerts.
- Federation via systems/promxy — the single-logical-Prometheus view for applications.
Seen in¶
- sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — the canonical post introducing the fault-tolerant storage system. Covers tenancy (tenant-per-application with shuffle sharding + consolidated control plane), single-cluster reliability (per-replica guardrails, query sharding, compaction sharding, zone-awareness), multi-cluster architecture (clusterization, progressive rollout, federation via Promxy with custom extensions), and key learnings (5–10× federated-query cost, deployment consistency, clusters-as-cattle).
Related¶
- systems/airbnb-observability-platform
- systems/prometheus
- systems/promxy
- systems/vmagent
- systems/grafana
- concepts/shuffle-sharding
- concepts/active-multi-cluster-blast-radius
- concepts/cross-cluster-federated-query-cost
- concepts/blast-radius
- concepts/tenant-isolation
- patterns/tenant-per-application
- patterns/progressive-cluster-rollout
- patterns/workload-segregated-clusters