Airbnb¶
Airbnb Engineering blog. Tier-2 source on the sysdesign-wiki. Historically strong on marketplace infra, Kubernetes tooling, data platform, and service mesh; recent posts cover dynamic configuration, developer platform, and incident tooling.
Key systems¶
- systems/sitar — internal dynamic configuration platform (control plane + data plane + sidecar agent + GitHub-based config workflow)
- systems/airbnb-observability-platform — in-house Prometheus/PromQL metrics platform (1,000 services, 300M timeseries, 3,100 dashboards, 300K+ alerts) replacing a vendor stack after a ~5-year migration; OTLP collection + vmagent streaming aggregation at 100M+ samples/sec
- systems/vmagent — VictoriaMetrics agent used as Airbnb's sharded two-tier (router + aggregator) streaming-aggregation tier
- systems/himeji — centralized authorization system enforcing access at the data layer; write-time relation denormalization for fast read-time permission checks
- systems/airbnb-destination-recommendation — transformer-based
sequence model predicting user travel destinations; user actions as
tokens (summed
city + region + days-to-todayembeddings); multi-task region + city heads; serves autosuggest + abandoned-search email notifications
Key patterns / concepts¶
- patterns/staged-rollout — first-class platform feature in Sitar (env / zone / pod-%)
- patterns/sidecar-agent — per-pod config fetcher with local cache
- patterns/git-based-config-workflow — PRs as the default config change path; emergency portal as override
- concepts/control-plane-data-plane-separation — explicit "decide" vs "deliver" split in Sitar
- concepts/observability — own-the-interaction-layer thesis; vendor pricing / feedback-loop motivations for going in-house
- concepts/metric-type-metadata —
_otel_metric_type_-driven engine that replaces Prometheus naming-based type inference - patterns/intent-preserving-query-translation — map query intent (e.g., canonical histogram for any p95), not literal queries
- patterns/alerts-as-code — Reliability XP alert framework with autocomplete, backtesting, and diffing
- patterns/alert-backtesting — replay proposed alerts against historical metric data at PR-diff granularity, with "noisiness" scoring + per-alert inspection; hooks into Prometheus's rule manager
- patterns/achievable-target-first-migration — start migration with a tractable, well-aligned service, not the hardest one
- concepts/identity-decoupling — User ID vs. per-context Profile IDs as a privacy primitive; different types, not just different values
- concepts/least-privileged-access — enforced at the data layer via Himeji, not bolted on per endpoint
- patterns/audit-then-refactor-migration — audit scripts → team ownership map → manual review → AI-assisted refactor → type safety, used for the User/Profile ID migration
- patterns/dual-write-migration — shared metrics library dual-emits StatsD + OTLP to migrate ~40% of services with one config change
- patterns/zero-injection-counter — vmagent tweak that fixes
Prometheus
rate()undercounting of sparse counters - concepts/streaming-aggregation — in-transit metric aggregation (vmagent routers + aggregators) to collapse per-instance cardinality before storage
- concepts/metric-temporality — delta vs. cumulative; Airbnb moved top-cardinality emitters to delta to bound SDK memory
- concepts/user-action-as-token — language-modeling framing for recommendation: chronological user actions as transformer tokens; per-action embedding = sum of attribute embeddings (city / region / days-to-today)
- patterns/active-dormant-user-training-split — generate N+M training examples per positive outcome — N recent with full history, M dormant with long-term history only — to keep a single model accurate for both recently-active and long-dormant users (Airbnb: 14 examples per booking = 7 active + 7 dormant)
- patterns/hierarchical-multitask-geo-prediction — attach multiple prediction heads at different geographic-hierarchy levels (region + city) and train jointly so the encoder learns the taxonomy via auxiliary-task regularization
Recent articles¶
- 2026-04-16 — sources/2026-04-16-airbnb-statsd-to-otel-metrics-pipeline (StatsD → OTLP migration via shared-library dual-write; two-tier vmagent streaming aggregation at 100M+ samples/sec; delta temporality for top emitters; zero-injection for sparse counters)
- 2026-04-14 — sources/2026-04-14-airbnb-privacy-first-connections (privacy-first identity model for social Experiences: User ID ↔ many context-scoped Profile IDs, Himeji authorization with write-time relation denormalization, AI-assisted audit+refactor migration)
- 2026-03-17 — sources/2026-03-17-airbnb-observability-ownership-migration (5-year vendor → in-house Prometheus/PromQL metrics migration; intent-preserving translation, metadata engine, alerts-as-code, own the interaction layer)
- 2026-03-04 — sources/2026-03-04-airbnb-alert-backtesting-change-reports
(deep-dive on the Reliability XP alert-authoring platform: local-first
dev + Change Reports + bulk alert
backtesting hooking Prometheus's
rules/manager.go; per-backtest K8s pod isolation; 300K alerts migrated, 90% alert-noise reduction, month → afternoon iteration cycle) - 2026-03-12 — sources/2026-03-12-airbnb-destination-recommendation-transformer
(transformer-based destination recommendation model; user actions as
tokens with summed
city + region + days-to-todayembeddings; 14 training examples per booking = 7 active + 7 dormant to balance short-term and long-term intent; multi-task region + city heads to inject geolocation hierarchy; autosuggest + abandoned-search email applications, A/B wins in non-English-primary regions) - 2026-02-18 — sources/2026-02-18-airbnb-sitar-dynamic-configuration (Sitar: dynamic config platform architecture)