AIRBNB 2026-04-21 Tier 2

Airbnb — Building a fault-tolerant metrics storage system at Airbnb¶

Summary¶

Airbnb's Observability team moved their metrics stack from a hosted vendor to an internally operated system that ingests 50M samples/sec and persists 2.5 PB of logical time-series data across 1.3B active time series. Author Rishabh Kumar walks through four design challenges and their solutions: (1) multi-tenancy (rejected tenant-per-team; chose tenant-per-application for ~1,000 services; shuffle sharding for per-tenant read/write isolation; a consolidated control plane for auto-onboarding), (2) single- cluster reliability (per-replica limits, query sharding, compaction sharding at 8M series/worker, zone-aware stateful components across three zones), (3) multi-cluster federation (dedicated clusters for specialised workloads, progressive rollout from test → internal → application → infrastructure clusters, custom Promxy extensions for cross-cluster querying), and (4) operational philosophy (Grafana Kubernetes rollout operators replace multi-day manual deploys; clusters as cattle, not pets; cross-cluster queries are 5–10× costlier than single-cluster and must be actively managed).

Key takeaways¶

Tenant-per-application beats tenant-per-team because team ownership of apps changes frequently, while services are long- lived and attribute metric growth precisely. ~1,000 services = ~1,000 tenants. (Source: sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system)
Shuffle sharding isolates both reads and writes: each tenant hashes to a subset of ingesters and a subset of query workers. A DDoS or bad query from tenant A can knock out A's shuffle set without impacting tenants D & E on disjoint shards. Visual: 4 shards, 5 tenants, attacker-affected shards ∩ victim shards = the blast radius. See concepts/shuffle-sharding.
Expose few guardrails, derive the rest. Airbnb exposes per-tenant series limits and derives ingestion rate + burst size from them. Fewer knobs = less operator confusion when a tenant hits a limit. See patterns/tenant-per-application.
Query payload variability is the hard thing to benchmark. Write components benchmark cleanly against samples-per-second. Read components don't — payload size (series count / chunks / MB) dominates cost. Guardrails: max 5,000 series / max 500 MB per query are the order-of-magnitude limits beyond which Airbnb saw degradation.
Separate evaluation query path from ad-hoc query path. Alert rule evaluation is critical-path; dashboard / CI queries are bursty and less critical. Splitting them prevents a dashboard user from degrading alerting.
Compaction at scale is workload-sharded. For very large tenants, compaction workers process up to 8M series each, ensuring data being read is always compacted (vs. being partway through compaction, which degrades query performance).
Zone-aware stateful components across three zones is the baseline for fault tolerance — survives zonal outages, survives node rotation during deploys.
Multi-cluster architecture for blast radius. Dedicated clusters per specialised workload (compute, mesh) + multiple application clusters. Failure in one doesn't cascade. See concepts/active-multi-cluster-blast-radius.
Progressive cluster rollout by criticality: test → internal → application → infrastructure. The last tier carries the highest- criticality workloads where a regression would cause "flying blind" — they roll out last, with the longest soak. This is the mechanism that hits the >99.9% availability target. See patterns/progressive-cluster-rollout.
Cross-cluster (federated) queries are 5–10× more expensive than queries within a single cluster. A handful of expensive federated queries was enough to cause read reliability issues across multiple clusters. Forced adjustments to tenant- consolidation strategy around hot read patterns. See concepts/cross-cluster-federated-query-cost.
Deployment consistency is the long-term cost of multi-cluster. Manual stateful-app deploys were too slow (days) and caused config drift. Grafana's Kubernetes OSS rollout operators, customised to respect Airbnb's pod-disruption-budget requirements, replaced the manual process. Seamless deploys also reduce configuration drift across clusters.
Custom Promxy for cross-cluster querying: Airbnb added native histogram support and query fanout optimization on top of the OSS systems/promxy. Upstream Promxy didn't handle native histograms at the time, breaking p99/p95 correctness across clusters; the fanout optimization narrows dispatch to relevant cluster(s) rather than broadcasting to every backend.
Clusters as cattle, not pets. Single-cluster scaling and stabilisation work made clusters self-tune, enabling new clusters to be added or replaced with minimal operational overhead — the operational north-star philosophy for the multi-cluster era.

Systems / concepts / patterns extracted¶

Operational numbers¶

50M samples/sec ingestion (peak).
1.3B active time series on average.
2.5 PB logical time-series data stored.
~1,000 services = ~1,000 tenants.
10,000 dashboards supported.
500,000 alerts in flight.
p99 query execution time < 30 s — SLO target.
>99.9% availability — achieved with multi-cluster federation.
>5,000 series / >500 MB per query — the payload size at which query performance degrades noticeably (numbers will vary per querying system).
8M series per compaction worker — sharding unit for large- tenant compaction.
3 zones — zonal deployment baseline for stateful components.
5–10× cost amplification on cross-cluster queries vs. single- cluster.
Days → seamless — deployment duration for stateful apps, before → after Grafana K8s rollout operators + Airbnb pod- disruption-budget compatibility work.

Caveats¶

The post is a high-level survey of the storage system's design — it doesn't name the actual TSDB implementation (VictoriaMetrics? Mimir? Cortex? A custom fork?). We can infer from context (vmagent
Promxy + PromQL + "treated as cattle" ops philosophy) that it's VictoriaMetrics / VM-compatible, but this is not stated in the post.
No sample architecture diagram included in the extracted markdown (the raw file has "Press enter or click to view image in full size" placeholders). Key diagrams referenced: the shuffle-sharding tenant-impact diagram, the progressive-cluster- rollout diagram.
"5–10× more expensive" for federated queries is unitless — the post doesn't specify CPU-seconds vs. wall-clock vs. bytes- scanned. The multiplier is a rough order-of-magnitude claim, not a measured distribution.
No failure-mode details on the Promxy custom extensions — how the query-fanout optimization picks which clusters to fan out to, how tenant-cluster mapping is kept in sync with the routing layer, how partial-result errors surface.
No cost numbers — the post doesn't disclose cost-per-sample or the overall footprint dollar value. Scale numbers only.
No vendor-to-Airbnb migration timeline in this post — referenced in a separate 2026-03-17 migration retrospective (sources/2026-03-17-airbnb-observability-ownership-migration).
Per-tenant shuffle-set sizing (K value) not disclosed — only the diagram (4 shards, 5 tenants) as an illustrative example.
The ~1,000 services number is a snapshot; service count trends up with product growth and will change. Shuffle-sharding math only works well above ~100 tenants.
Autoscaling specifics (triggers, scale-up/down SLAs) not disclosed for the read tier.