Skip to content

Airbnb

Airbnb Engineering blog. Tier-2 source on the sysdesign-wiki. Historically strong on marketplace infra, Kubernetes tooling, data platform, and service mesh; recent posts cover dynamic configuration, developer platform, incident tooling, and (2026-05) the OSS 1.0 release of the Viaduct GraphQL multi-tenant runtime.

Key systems

  • systems/airbnb-knowledge-graph-infrastructure — internally managed, multi-tenant knowledge graph platform built on JanusGraph + DynamoDB + OpenSearch; namespace-isolated tenants include identity graph (7B nodes, 11B edges), inventory knowledge graph, fraud detection, and data lineage; management service handles schema enforcement + index lifecycle + Thrift API generation; internal JanusGraph fork with custom transaction strategy (DynamoDB conditional writes), parallel multi-slice fetching, and distributed tracing
  • systems/viaduct — Airbnb's GraphQL-based "data-oriented service mesh": a multi-tenant runtime that hosts independently developed and tested tenant modules, each owning a portion of the schema. Used internally for years; OSS 1.0 released 2026-05-13 to Maven Central with full @StableApi / @ExperimentalApi / @InternalApi discipline + Kotlin binary compatibility validator in CI + Dokka docs + open RFC community process. The third topology on the wiki for decentralized development of a central GraphQL schema, positioned as complementary to Apollo Federation (Viaduct instances can participate as Federation subgraphs, collapsing per-team server cost while preserving cross-org composition). Load-bearing framing: "Federation distributes development by distributing servers. Viaduct distributes development by distributing modules."
  • systems/airbnb-skipper — embedded Java/Kotlin workflow engine providing durable execution as a library dependency rather than an external orchestration cluster; uses the host service's existing database (MySQL / UDS / DynamoDB) for workflow state, replays the workflow method on crash while short-circuiting previously completed actions via stored results; 5-annotation programming model (@WorkflowMethod, @StateField, @SignalMethod, @Execute(checkpoint = true), @Compensate); in production >1 year across 15+ teams (insurance, payments, media, infrastructure, incentives, wallet) with peak 10 000 workflows / second on DynamoDB
  • systems/airbnb-uds — Airbnb's internal Unified Data Store (stub); named as one of Skipper's pluggable persistence backends alongside MySQL
  • systems/sitar — internal dynamic configuration platform (control plane + data plane + sidecar agent + GitHub-based config workflow)
  • systems/airbnb-observability-platform — in-house Prometheus/PromQL metrics platform (1,000 services, 300M timeseries, 3,100 dashboards, 300K+ alerts) replacing a vendor stack after a ~5-year migration; OTLP collection + vmagent streaming aggregation at 100M+ samples/sec; reliability plane (2026-05-05): dedicated-but-managed K8s clusters + custom Envoy L7 ingress tier (independent of Istio) + meta-monitoring HA Prometheus–Alertmanager pairs terminated by a dead-man's switch on AWS SNS + CloudWatch
  • systems/airbnb-metrics-storage — the storage plane under the observability platform: multi-cluster, multi-tenant time-series storage fleet at 50M samples/sec / 1.3B active series / 2.5 PB logical data; tenant-per-application with shuffle sharding on read + write paths; three-zone stateful deploys; multi-cluster federation via custom Promxy with native-histogram support + query-fanout optimization; progressive cluster rollout (test → internal → app → infra) for >99.9% availability
  • systems/vmagent — VictoriaMetrics agent used as Airbnb's sharded two-tier (router + aggregator) streaming-aggregation tier
  • systems/himeji — centralized authorization system enforcing access at the data layer; write-time relation denormalization for fast read-time permission checks
  • systems/airbnb-destination-recommendation — transformer-based sequence model predicting user travel destinations; user actions as tokens (summed city + region + days-to-today embeddings); multi-task region + city heads; serves autosuggest + abandoned-search email notifications

Key patterns / concepts

Recent articles

  • 2026-06-09 — sources/2026-06-09-airbnb-scaling-beyond-one-data-architecture (Patrick Lam, Namrata Lamba, Jamie Stober on "Scaling beyond one: How Airbnb evolved its data architecture for a multi-product world" — framework for evolving a decade-old offline data warehouse from single-product (Homes) to three-product (Homes, Experiences, Services). Three foundational principles: no hybrid models, consistent identifier naming, namespace organization. Domain-driven modeling choice: product-facing domains chose separate models (listings, availability, location, guests), cross-cutting domains chose monolithic (messaging, payments, support). Data debt migration via dual-pipeline deprecation.)
  • 2026-05-19 — sources/2026-05-19-airbnb-scaling-identity-graph-unified-knowledge-graph-infrastructure (Lucen Zhao, Shukun Yang, Ashish Jain on "Scaling Airbnb's identity graph with a unified knowledge graph infrastructure" — internal multi-tenant graph platform on JanusGraph + DynamoDB replacing third-party vendor. Identity graph at 7B nodes / 11B edges / 5M edges per day / 4–8 hop queries. Migration delivered 10× write QPS, significant P99 latency reduction, elimination of periodic reboots. Key optimizations: DynamoDB conditional-write transactions, parallel getMultiSlices, client-side Gremlin query rewriting, distributed tracing integration.)
  • 2026-05-13 — sources/2026-05-13-airbnb-viaduct-1-0-and-the-future-of-airbnbs-data-mesh (Ryan Tanner, Raymie Stata, Adam Miskiewicz on "Viaduct 1.0 and the future of Airbnb's data mesh" — OSS 1.0 release of Viaduct, Airbnb's GraphQL-based data-oriented service mesh used internally for years. Architectural contribution: third topology for decentralized development of a central GraphQL schema alongside UBFF (one service, one module) and Apollo Federation (many services, one module each): Viaduct is few runtimes, many tenant modules per runtime. Load-bearing framing quote: "Federation distributes development by distributing servers. Viaduct distributes development by distributing modules." Tenant module contract is intentionally minimal: directory + SDL
  • resolvers; the platform handles execution / scaling / integration. Complementary to Federation, not alternative: a Viaduct instance can participate as a subgraph in a federated supergraph, so a "large organization where hundreds of teams contribute to the overall graph" can run "a smaller number of Viaduct instances, each hosting many closely related tenant modules" and let Federation compose them — collapsing per-team server cost (factor of M for M modules per instance) while preserving cross-org composition. OSS 1.0 readiness substrate: @StableApi / @ExperimentalApi / @InternalApi annotations across all public surfaces + Kotlin binary compatibility validator in CI + Maven Central publication + Dokka-generated API docs + open community RFC process (first instance: the Connections RFC on GitHub). GraphQLConf 2026 talk teasers signpost forthcoming engineering-retrospective content on multi-tenant gateway observability with built-in ownership tags + cost-aware tracing (Vickey Yeh), gateway sharding for blast-radius reduction (Linquan Zhang & Cetin Sahin), probabilistic correctness testing on Viaduct (James Bellenger), and LLM-driven @generateMock data generation (Michael Rebello) — ingestion candidates when the recordings/write-ups appear.)
  • 2026-05-05 — sources/2026-05-05-airbnb-monitoring-reliably-at-scale (Abdurrahman J. Allawala on "Monitoring reliably at scale" — Airbnb's Observability team breaks circular dependencies in its metrics platform along three axes: (1) dedicated- but-managed Kubernetes clusters for observability workloads (concepts/dedicated-but-managed-infrastructure
  • patterns/dedicated-observability-kubernetes-clusters) — "just right" middle option between shared-production (couples observability to its targets) and self-run K8s (too much ops burden on the small team); Cloud team still administers; coordinated-change discipline enforced; (2) custom Envoy L7 ingress tier for telemetry, independent of the shared Istio mesh (patterns/custom-l7-proxy-for-telemetry-over-service-mesh), with header-based tenant routing mapping ~1,000 services → cluster backends; motivated by "orders of magnitude more observability traffic than business traffic" + circular dependency of mesh-metrics on the mesh + two-way noisy-neighbour hazard with Airbnb.com traffic; adds an eighth Envoy role on the wiki (telemetry-ingress); extensibility hooks for metric mirroring + fine-grained ACLs; (3) meta-monitoring (concepts/meta-monitoring) — dedicated systems/prometheus + systems/alertmanager HA pairs pinned to nodes/AZs disjoint from the primary stack with pair- level anti-affinity; terminated by a dead-man's switch (patterns/heartbeat-absence-as-alert-trigger) — always- firing alert → SNSCloudWatch rate alarm on an AWS control plane distinct from the K8s-hosted stack; design bar stated verbatim: "treat monitoring as a production system whose availability must exceed that of what it observes." Compute-vs-networking own-vs-adopt asymmetry articulated explicitly — Kubernetes adopted because the shared foundation fits; networking owned because telemetry's requirements (prioritisation / isolation / custom routing) diverge from what a business-traffic-shaped mesh can cleanly provide.)

  • 2026-04-28 — sources/2026-04-28-airbnb-skipper-building-airbnbs-embedded-workflow-engine (Skipper: embedded Java/Kotlin workflow engine for durable execution; library-in-service shape rather than external orchestration cluster (concepts/embedded-workflow-engine); shares host service's DB for workflow state (MySQL / UDS / DynamoDB); 5-annotation programming model (patterns/workflow-primitives-as-annotated-classes); state-field replay not event history; near-zero happy-path overhead via delayed timeout task; determinism invariant + at-least-once action execution; @Compensate reverse-order walk-back elevates saga compensations to first-class primitive (concepts/workflow-compensation-action); signals via @SignalMethod + durable waitUntil { cond }; in production >1 year across 15+ teams — peak 10 000 workflows / second on DynamoDB; multi-hour Media Foundation video-processing jobs survive pod restarts; Infrastructure team uses it for durable Flink job lifecycle; explicit rejection of external clusters for Tier 0 services to avoid new critical dependencies)

  • 2026-04-21 — sources/2026-04-21-airbnb-building-a-fault-tolerant-metrics-storage-system (storage-plane deep-dive: 50M samples/sec, 1.3B series, 2.5 PB; shuffle-sharding for per-tenant read/write isolation; tenant-per-application for ~1,000 services; single-cluster reliability → multi-cluster federation via custom Promxy; progressive cluster rollout for >99.9% availability; 5–10× federated-query cost tax; Grafana K8s rollout operators replacing multi-day manual deploys; clusters as cattle not pets)
  • 2026-04-16 — sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline (StatsD → OTLP migration via shared-library dual-write; two-tier vmagent streaming aggregation at 100M+ samples/sec; delta temporality for top emitters; zero-injection for sparse counters)
  • 2026-04-14 — sources/2026-04-14-airbnb-privacy-first-connections (privacy-first identity model for social Experiences: User ID ↔ many context-scoped Profile IDs, Himeji authorization with write-time relation denormalization, AI-assisted audit+refactor migration)
  • 2026-03-17 — sources/2026-03-17-airbnb-observability-ownership-migration (5-year vendor → in-house Prometheus/PromQL metrics migration; intent-preserving translation, metadata engine, alerts-as-code, own the interaction layer)
  • 2026-03-04 — sources/2026-03-04-airbnb-alert-backtesting-change-reports (deep-dive on the Reliability XP alert-authoring platform: local-first dev + Change Reports + bulk alert backtesting hooking Prometheus's rules/manager.go; per-backtest K8s pod isolation; 300K alerts migrated, 90% alert-noise reduction, month → afternoon iteration cycle)
  • 2026-03-12 — sources/2026-03-12-airbnb-destination-recommendation-transformer (transformer-based destination recommendation model; user actions as tokens with summed city + region + days-to-today embeddings; 14 training examples per booking = 7 active + 7 dormant to balance short-term and long-term intent; multi-task region + city heads to inject geolocation hierarchy; autosuggest + abandoned-search email applications, A/B wins in non-English-primary regions)
  • 2026-02-18 — sources/2026-02-18-airbnb-sitar-dynamic-configuration (Sitar: dynamic config platform architecture)
Last updated · 542 distilled / 1,571 read