Skip to content

SYSTEM Cited by 1 source

Airbnb fault-tolerant metrics storage

The storage-plane subsystem inside systems/airbnb-observability-platform — the multi-cluster, multi-tenant time-series storage fleet that persists and serves Airbnb's metrics. Moved from a hosted metrics provider to an internally operated system. This is the system described in the 2026-04-21 post Building a fault-tolerant metrics storage system at Airbnb by Rishabh Kumar.

Scale

  • ~1,000 services → ~1,000 tenants (one-to-one; see patterns/tenant-per-application).
  • 50M samples/sec peak ingestion.
  • 1.3B active time series on average.
  • 2.5 PB of logical time-series data stored.
  • SLO target: p99 query execution time < 30 s with 10,000 dashboards + 500,000 alerts in flight.
  • Availability: >99.9% with multi-cluster federation.

Architecture — four load-bearing ideas

1. Shuffle sharding for tenant isolation

Every tenant (application / service) is mapped to a subset of storage nodes on both the write and read paths:

  • Writes: ingesters hash the tenant → K-node shuffle set. A tenant's DDoS / ingestion burst can saturate at most its K-node shuffle set, not the whole fleet.
  • Reads: Grafana queries hash the tenant → K-node subset of query workers. A runaway query (e.g., >5,000 series or 500 MB+ payload) impacts at most K workers.
  • Diagram from the post: four shards, five tenants A–E. DDoS from tenant A can take down shards 1 & 2 (affecting overlapping tenants B & C) but tenants D & E remain healthy on shards 3 & 4.

See concepts/shuffle-sharding for the general primitive.

2. Tenant-per-application with per-tenant guardrails

Alternatives rejected: - Tenant-per-team — team ownership changes too often; unstable.

Chosen: tenant-per-application. With ~1,000 services, this is a logical and stable grouping. Benefits: - Precise attribution of metric growth to specific applications. - Foundation for chargeback in future. - Meaningful per-tenant guardrails.

Guardrails exposed vs. derived: - Exposed: series limit. - Derived from series limit: ingestion rate, ingestion burst size. - Other per-tenant knobs: rule count, evaluation interval / offset, max series per query, max chunks per query.

Automated control plane: new service creation auto-enrolls as a tenant; config changes apply via single deployment rather than component-by-component manual editing.

See patterns/tenant-per-application.

3. Single-cluster reliability, then multi-cluster federation

Phase 1 — run a single cluster reliably.

  • Writes stabilisation: benchmark components, set per-replica limits based on ingest rate + inflight requests, set write guardrails starting with max time series per tenant (one-month lookback period, continuously adjusted).
  • Reads stabilisation: query sharding to normalise load on query workers; per-tenant read guardrails (fetched series per query, fetched chunks per query); separate evaluation query path from ad-hoc query path (dashboards, CI jobs) because the criticality is different; autoscaling the read tier for intraday throughput variation.
  • Compaction stabilisation: for large tenants, compaction workloads sharded such that each worker processes up to 8M series, ensuring reads always see compacted data.
  • Zone-awareness: both write and read stateful components deployed across three zones, so a zonal outage or node rotation is survivable.

Phase 2 — move to multi-cluster for blast-radius reasons.

Concerns that drove the shift: - Bad queries / bad deploys can degrade an entire cluster → "flying blind" risk for everyone on that cluster. - Desire for regional flexibility.

Clusterization strategy: - Dedicated clusters for specialized workloads (compute and mesh infrastructure). - Multiple application clusters for general workloads. - Failure in one cluster does not cascade to others.

Rollout strategy: progressive cluster rollout — test clusters → internal clusters → application clusters → infrastructure clusters — sequenced by criticality, not size. This is what achieves >99.9% availability in practice. See patterns/progressive-cluster-rollout.

4. Federation via Promxy with custom query-fanout

Cross-cluster querying is served by a Promxy-based federation proxy with Airbnb-specific additions: - Native histogram support for correct p99 / p95 over histogram series that span clusters. - Query fanout optimization to narrow fanout to only the relevant cluster(s), reducing wasted work.

See systems/promxy for the proxy; concepts/cross-cluster-federated-query-cost for the 5–10× cost amplification observation that forced adjustments in tenant-consolidation strategy.

Deployment

  • Stateful apps: deployed via Grafana's Kubernetes OSS rollout operators, enabling coordinated multi-AZ rolls across StatefulSets within a namespace. Replaced a manual, sequential process that took days.
  • Rollout operators made compatible with Airbnb's cloud infra, respecting pod disruption budget requirements while rollouts and node operations happen concurrently.
  • Automation here is what enabled the "treat clusters as cattle, not pets" operational philosophy — new clusters can be added or replaced with minimal operational overhead.

Key design choices / lessons

  • Per-replica limits provide actionable scaling signals for fleet management (vs. rate-based or tenant-aggregate limits which don't tell an on-call what to scale).
  • Tenant-level controls shield the system from disruptive behaviour — both malicious (DDoS) and accidental (runaway query, cardinality explosion).
  • Multi-zone stateful deployments enhance both fault tolerance and deployment agility (you can drain a zone for maintenance during a rollout).
  • Cross-cluster queries are 5–10× more expensive than single- cluster queries — a handful of expensive federated queries were enough to cause read reliability issues across clusters. This forced adjustments to tenant-consolidation strategy.
  • Deployment consistency matters: manual stateful-app deploys were the single largest source of configuration drift; automation (Grafana rollout operators) was the fix.
  • Cluster management philosophy: single-cluster scaling and stabilisation work enabled clusters to self-tune, which in turn enabled "clusters as cattle, not pets" — adding or replacing clusters with minimal additional operational overhead.

Relationship to the rest of the wiki

  • Parent system: systems/airbnb-observability-platform is the umbrella platform (instrumentation → collection → storage → visualization → alerting). This page covers the storage plane specifically — the multi-cluster, multi-tenant fleet that persists and serves metrics; airbnb-observability-platform covers the instrumentation (OTLP, StatsD) + aggregation (systems/vmagent)
  • alerting framework + Change Reports that wrap around it.
  • Storage-compatible with systems/prometheus — PromQL is the query language exposed to dashboards and alerts.
  • Federation via systems/promxy — the single-logical-Prometheus view for applications.

Seen in

  • sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system — the canonical post introducing the fault-tolerant storage system. Covers tenancy (tenant-per-application with shuffle sharding + consolidated control plane), single-cluster reliability (per-replica guardrails, query sharding, compaction sharding, zone-awareness), multi-cluster architecture (clusterization, progressive rollout, federation via Promxy with custom extensions), and key learnings (5–10× federated-query cost, deployment consistency, clusters-as-cattle).
Last updated · 319 distilled / 1,201 read