SYSTEM Cited by 1 source

Airbnb fault-tolerant metrics storage¶

The storage-plane subsystem inside systems/airbnb-observability-platform — the multi-cluster, multi-tenant time-series storage fleet that persists and serves Airbnb's metrics. Moved from a hosted metrics provider to an internally operated system. This is the system described in the 2026-04-21 post Building a fault-tolerant metrics storage system at Airbnb by Rishabh Kumar.

Scale¶

~1,000 services → ~1,000 tenants (one-to-one; see patterns/tenant-per-application).
50M samples/sec peak ingestion.
1.3B active time series on average.
2.5 PB of logical time-series data stored.
SLO target: p99 query execution time < 30 s with 10,000 dashboards + 500,000 alerts in flight.
Availability: >99.9% with multi-cluster federation.

Architecture — four load-bearing ideas¶

1. Shuffle sharding for tenant isolation¶

Every tenant (application / service) is mapped to a subset of storage nodes on both the write and read paths:

Writes: ingesters hash the tenant → K-node shuffle set. A tenant's DDoS / ingestion burst can saturate at most its K-node shuffle set, not the whole fleet.
Reads: Grafana queries hash the tenant → K-node subset of query workers. A runaway query (e.g., >5,000 series or 500 MB+ payload) impacts at most K workers.
Diagram from the post: four shards, five tenants A–E. DDoS from tenant A can take down shards 1 & 2 (affecting overlapping tenants B & C) but tenants D & E remain healthy on shards 3 & 4.

See concepts/shuffle-sharding for the general primitive.

2. Tenant-per-application with per-tenant guardrails¶

Alternatives rejected: - Tenant-per-team — team ownership changes too often; unstable.

Chosen: tenant-per-application. With ~1,000 services, this is a logical and stable grouping. Benefits: - Precise attribution of metric growth to specific applications. - Foundation for chargeback in future. - Meaningful per-tenant guardrails.

Guardrails exposed vs. derived: - Exposed: series limit. - Derived from series limit: ingestion rate, ingestion burst size. - Other per-tenant knobs: rule count, evaluation interval / offset, max series per query, max chunks per query.

Automated control plane: new service creation auto-enrolls as a tenant; config changes apply via single deployment rather than component-by-component manual editing.

See patterns/tenant-per-application.

3. Single-cluster reliability, then multi-cluster federation¶

Phase 1 — run a single cluster reliably.

Writes stabilisation: benchmark components, set per-replica limits based on ingest rate + inflight requests, set write guardrails starting with max time series per tenant (one-month lookback period, continuously adjusted).
Reads stabilisation: query sharding to normalise load on query workers; per-tenant read guardrails (fetched series per query, fetched chunks per query); separate evaluation query path from ad-hoc query path (dashboards, CI jobs) because the criticality is different; autoscaling the read tier for intraday throughput variation.
Compaction stabilisation: for large tenants, compaction workloads sharded such that each worker processes up to 8M series, ensuring reads always see compacted data.
Zone-awareness: both write and read stateful components deployed across three zones, so a zonal outage or node rotation is survivable.

Phase 2 — move to multi-cluster for blast-radius reasons.

Concerns that drove the shift: - Bad queries / bad deploys can degrade an entire cluster → "flying blind" risk for everyone on that cluster. - Desire for regional flexibility.

Clusterization strategy: - Dedicated clusters for specialized workloads (compute and mesh infrastructure). - Multiple application clusters for general workloads. - Failure in one cluster does not cascade to others.

Rollout strategy: progressive cluster rollout — test clusters → internal clusters → application clusters → infrastructure clusters — sequenced by criticality, not size. This is what achieves >99.9% availability in practice. See patterns/progressive-cluster-rollout.

4. Federation via Promxy with custom query-fanout¶

Cross-cluster querying is served by a Promxy-based federation proxy with Airbnb-specific additions: - Native histogram support for correct p99 / p95 over histogram series that span clusters. - Query fanout optimization to narrow fanout to only the relevant cluster(s), reducing wasted work.

See systems/promxy for the proxy; concepts/cross-cluster-federated-query-cost for the 5–10× cost amplification observation that forced adjustments in tenant-consolidation strategy.

Deployment¶

Stateful apps: deployed via Grafana's Kubernetes OSS rollout operators, enabling coordinated multi-AZ rolls across StatefulSets within a namespace. Replaced a manual, sequential process that took days.
Rollout operators made compatible with Airbnb's cloud infra, respecting pod disruption budget requirements while rollouts and node operations happen concurrently.
Automation here is what enabled the "treat clusters as cattle, not pets" operational philosophy — new clusters can be added or replaced with minimal operational overhead.

Key design choices / lessons¶

Per-replica limits provide actionable scaling signals for fleet management (vs. rate-based or tenant-aggregate limits which don't tell an on-call what to scale).
Tenant-level controls shield the system from disruptive behaviour — both malicious (DDoS) and accidental (runaway query, cardinality explosion).
Multi-zone stateful deployments enhance both fault tolerance and deployment agility (you can drain a zone for maintenance during a rollout).
Cross-cluster queries are 5–10× more expensive than single- cluster queries — a handful of expensive federated queries were enough to cause read reliability issues across clusters. This forced adjustments to tenant-consolidation strategy.
Deployment consistency matters: manual stateful-app deploys were the single largest source of configuration drift; automation (Grafana rollout operators) was the fix.
Cluster management philosophy: single-cluster scaling and stabilisation work enabled clusters to self-tune, which in turn enabled "clusters as cattle, not pets" — adding or replacing clusters with minimal additional operational overhead.

Relationship to the rest of the wiki¶

Parent system: systems/airbnb-observability-platform is the umbrella platform (instrumentation → collection → storage → visualization → alerting). This page covers the storage plane specifically — the multi-cluster, multi-tenant fleet that persists and serves metrics; airbnb-observability-platform covers the instrumentation (OTLP, StatsD) + aggregation (systems/vmagent)
alerting framework + Change Reports that wrap around it.
Storage-compatible with systems/prometheus — PromQL is the query language exposed to dashboards and alerts.
Federation via systems/promxy — the single-logical-Prometheus view for applications.

Seen in¶

sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — the canonical post introducing the fault-tolerant storage system. Covers tenancy (tenant-per-application with shuffle sharding + consolidated control plane), single-cluster reliability (per-replica guardrails, query sharding, compaction sharding, zone-awareness), multi-cluster architecture (clusterization, progressive rollout, federation via Promxy with custom extensions), and key learnings (5–10× federated-query cost, deployment consistency, clusters-as-cattle).