Skip to content

AIRBNB 2026-04-21 Tier 2

Read original ↗

Airbnb — Building a fault-tolerant metrics storage system at Airbnb

Summary

Airbnb's Observability team moved their metrics stack from a hosted vendor to an internally operated system that ingests 50M samples/sec and persists 2.5 PB of logical time-series data across 1.3B active time series. Author Rishabh Kumar walks through four design challenges and their solutions: (1) multi-tenancy (rejected tenant-per-team; chose tenant-per-application for ~1,000 services; shuffle sharding for per-tenant read/write isolation; a consolidated control plane for auto-onboarding), (2) single- cluster reliability (per-replica limits, query sharding, compaction sharding at 8M series/worker, zone-aware stateful components across three zones), (3) multi-cluster federation (dedicated clusters for specialised workloads, progressive rollout from test → internal → application → infrastructure clusters, custom Promxy extensions for cross-cluster querying), and (4) operational philosophy (Grafana Kubernetes rollout operators replace multi-day manual deploys; clusters as cattle, not pets; cross-cluster queries are 5–10× costlier than single-cluster and must be actively managed).

Key takeaways

  1. Tenant-per-application beats tenant-per-team because team ownership of apps changes frequently, while services are long- lived and attribute metric growth precisely. ~1,000 services = ~1,000 tenants. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)

  2. Shuffle sharding isolates both reads and writes: each tenant hashes to a subset of ingesters and a subset of query workers. A DDoS or bad query from tenant A can knock out A's shuffle set without impacting tenants D & E on disjoint shards. Visual: 4 shards, 5 tenants, attacker-affected shards ∩ victim shards = the blast radius. See concepts/shuffle-sharding.

  3. Expose few guardrails, derive the rest. Airbnb exposes per-tenant series limits and derives ingestion rate + burst size from them. Fewer knobs = less operator confusion when a tenant hits a limit. See patterns/tenant-per-application.

  4. Query payload variability is the hard thing to benchmark. Write components benchmark cleanly against samples-per-second. Read components don't — payload size (series count / chunks / MB) dominates cost. Guardrails: max 5,000 series / max 500 MB per query are the order-of-magnitude limits beyond which Airbnb saw degradation.

  5. Separate evaluation query path from ad-hoc query path. Alert rule evaluation is critical-path; dashboard / CI queries are bursty and less critical. Splitting them prevents a dashboard user from degrading alerting.

  6. Compaction at scale is workload-sharded. For very large tenants, compaction workers process up to 8M series each, ensuring data being read is always compacted (vs. being partway through compaction, which degrades query performance).

  7. Zone-aware stateful components across three zones is the baseline for fault tolerance — survives zonal outages, survives node rotation during deploys.

  8. Multi-cluster architecture for blast radius. Dedicated clusters per specialised workload (compute, mesh) + multiple application clusters. Failure in one doesn't cascade. See concepts/active-multi-cluster-blast-radius.

  9. Progressive cluster rollout by criticality: test → internal → application → infrastructure. The last tier carries the highest- criticality workloads where a regression would cause "flying blind" — they roll out last, with the longest soak. This is the mechanism that hits the >99.9% availability target. See patterns/progressive-cluster-rollout.

  10. Cross-cluster (federated) queries are 5–10× more expensive than queries within a single cluster. A handful of expensive federated queries was enough to cause read reliability issues across multiple clusters. Forced adjustments to tenant- consolidation strategy around hot read patterns. See concepts/cross-cluster-federated-query-cost.

  11. Deployment consistency is the long-term cost of multi-cluster. Manual stateful-app deploys were too slow (days) and caused config drift. Grafana's Kubernetes OSS rollout operators, customised to respect Airbnb's pod-disruption-budget requirements, replaced the manual process. Seamless deploys also reduce configuration drift across clusters.

  12. Custom Promxy for cross-cluster querying: Airbnb added native histogram support and query fanout optimization on top of the OSS systems/promxy. Upstream Promxy didn't handle native histograms at the time, breaking p99/p95 correctness across clusters; the fanout optimization narrows dispatch to relevant cluster(s) rather than broadcasting to every backend.

  13. Clusters as cattle, not pets. Single-cluster scaling and stabilisation work made clusters self-tune, enabling new clusters to be added or replaced with minimal operational overhead — the operational north-star philosophy for the multi-cluster era.

Systems / concepts / patterns extracted

Operational numbers

  • 50M samples/sec ingestion (peak).
  • 1.3B active time series on average.
  • 2.5 PB logical time-series data stored.
  • ~1,000 services = ~1,000 tenants.
  • 10,000 dashboards supported.
  • 500,000 alerts in flight.
  • p99 query execution time < 30 s — SLO target.
  • >99.9% availability — achieved with multi-cluster federation.
  • >5,000 series / >500 MB per query — the payload size at which query performance degrades noticeably (numbers will vary per querying system).
  • 8M series per compaction worker — sharding unit for large- tenant compaction.
  • 3 zones — zonal deployment baseline for stateful components.
  • 5–10× cost amplification on cross-cluster queries vs. single- cluster.
  • Days → seamless — deployment duration for stateful apps, before → after Grafana K8s rollout operators + Airbnb pod- disruption-budget compatibility work.

Caveats

  • The post is a high-level survey of the storage system's design — it doesn't name the actual TSDB implementation (VictoriaMetrics? Mimir? Cortex? A custom fork?). We can infer from context (vmagent
  • Promxy + PromQL + "treated as cattle" ops philosophy) that it's VictoriaMetrics / VM-compatible, but this is not stated in the post.
  • No sample architecture diagram included in the extracted markdown (the raw file has "Press enter or click to view image in full size" placeholders). Key diagrams referenced: the shuffle-sharding tenant-impact diagram, the progressive-cluster- rollout diagram.
  • "5–10× more expensive" for federated queries is unitless — the post doesn't specify CPU-seconds vs. wall-clock vs. bytes- scanned. The multiplier is a rough order-of-magnitude claim, not a measured distribution.
  • No failure-mode details on the Promxy custom extensions — how the query-fanout optimization picks which clusters to fan out to, how tenant-cluster mapping is kept in sync with the routing layer, how partial-result errors surface.
  • No cost numbers — the post doesn't disclose cost-per-sample or the overall footprint dollar value. Scale numbers only.
  • No vendor-to-Airbnb migration timeline in this post — referenced in a separate 2026-03-17 migration retrospective (sources/2026-03-17-airbnb-observability-ownership-migration).
  • Per-tenant shuffle-set sizing (K value) not disclosed — only the diagram (4 shards, 5 tenants) as an illustrative example.
  • The ~1,000 services number is a snapshot; service count trends up with product growth and will change. Shuffle-sharding math only works well above ~100 tenants.
  • Autoscaling specifics (triggers, scale-up/down SLAs) not disclosed for the read tier.

Source

Last updated · 319 distilled / 1,201 read