Expedia Group¶
Expedia Group Tech blog (medium.com/expedia-group-tech). Tier-3 source on the sysdesign-wiki. Mix of product-PR, hiring, and substantive architecture content; the architecture posts skew toward data platform (Iceberg / Spark / Trino), streaming (Kafka Streams), ML platform (Feast / vector DBs / embeddings), search/ranking experimentation (interleaving as an A/B alternative), LLM-assisted reliability engineering (STAR — Service Telemetry Analyzer), and testing/observability practices.
Key systems¶
- systems/expedia-embedding-store — Expedia ML Platform team's centralized embedding platform: vector-database-backed service with standardized APIs, Feast-managed collection metadata, three ingestion modes (batch via Feast materialization on Spark, Insert API, on-the-fly model invocation), online/offline storage with offline-→-online restore, and similarity + hybrid search surfaces. Canonical wiki reference for the centralized-embedding-platform pattern and for Feast being used as an embedding-collection registry rather than only a feature-view registry.
- systems/feast — used by the Embedding Store as its metadata layer (associated service + model + version per collection) — extends Feast's usage scope from features to embeddings.
- systems/apache-iceberg — the open table format the 2025-09-30
MERGE INTOvsINSERT OVERWRITEpost prescribes row-level update strategy for; Expedia is one of the named Iceberg shops whose public guidance informs the canonical patterns. - systems/kafka / systems/kafka-streams — the streaming substrate behind the 2025-11-11 sub-topology / partition-colocation post. Expedia runs Kafka Streams in production for near-real-time pipelines; the post is a production-debugging retrospective that articulates a subtle Kafka Streams guarantee (partition colocation is sub-topology-scoped, not topic-count-scoped) and a structural fix (patterns/shared-state-store-as-topology-unifier).
- systems/trino / systems/trino-gateway — the distributed
SQL engine + its open-source query gateway that Expedia runs in
production (2026-03-24 post). Expedia operates a workload-segregated
Trino cluster fleet (Adhoc / ETL / BI) fronted by Trino Gateway,
which routes queries via rules that inspect tables touched, query
body, and
X-Trino-Sourceheaders (Tableau / Looker detection). Expedia also contributed four operator-UX features upstream — a UI for routing-rule management, a source filter on the history page, three-state cluster-health display, and a full-query-text window — promoting Trino Gateway from CLI-managed to UI-managed. - systems/expedia-lodging-ranker — the ranking algorithm behind Expedia's lodging search results. Architecture undisclosed; the 2026-02-17 post exposes only its experimentation harness: interleaving-based evaluation with click + booking attribution, lift reported at user level, significance via winning-indicator t-test as a fast substitute for bootstrap percentile. Synthetic regressions (random-property pinning to slots 5–10, top-slot reshuffling) detected by interleaving within days; A/B testing on CVR uplift fails to detect the pinning regression even at full sample size.
- systems/expedia-star — STAR (Service Telemetry Analyzer), a FastAPI web service (with Celery + Redis task queue) that reads Datadog metrics and calls Expedia's internal generative-AI proxy to produce automated root-cause analysis for degraded services. Deliberately not an agent — no function calling, no MCP, no RAG, no memory — a fixed four-step chain (collect telemetry → per-metric analysis → aggregated RCA → return insights) with role / chaining / generated-knowledge prompt engineering. Five named use cases: incident investigation, post-incident RCA, Kubernetes/JVM troubleshooting runbooks, performance optimization, and failure-injection evaluation. Tailored to Expedia's infrastructure layer (Kubernetes + JVM). Canonical wiki instance of patterns/static-prompt-chain-over-agent-loop and the wiki's first in-production Langfuse-traced incident-RCA pipeline outside Yelp BAA.
- systems/expedia-generative-ai-proxy — Expedia's internal LLM choke point: centralised authn/authz, multi-model access, rate-limited. Upstream of every STAR prompt.
Key patterns / concepts¶
ML platform / embeddings (2026-01-06 Embedding Store post)¶
- patterns/centralized-embedding-platform — single org-wide vector-embedding service behind standardized APIs; the Expedia Embedding Store is the canonical wiki instance.
- patterns/embedding-ingestion-modes — batch (Feast materialization on Spark) + Insert API + on-the-fly model invocation; the three complementary producer lanes.
- patterns/dual-write-online-offline — every ingest lands in both online (vector DB) and offline (historical repository) stores regardless of mode; enables offline-→-online restore.
- concepts/embedding-collection — the organizational unit of the vector DB; schema + model + distance metric pinned; the platform's governance unit.
- concepts/hybrid-search — vector similarity combined with
structured metadata filters (
price < 100,category = electronics); wiki-distinct from concepts/hybrid-retrieval-bm25-vectors (BM25 + dense-vector fusion).
Iceberg / data lake (2025-09-30 post)¶
- patterns/merge-into-over-insert-overwrite — prefer
MERGE INTO(row-level) overINSERT OVERWRITE(partition-level) on Iceberg; MOR + compaction as the paired strategy. - concepts/merge-on-read — the Iceberg write-optimized strategy
(delta files merged at query time) that makes
MERGE INTOa win for CDC / SCD / incremental workloads. - concepts/copy-on-write-merge — the compaction strategy that keeps MOR healthy; also an Iceberg update strategy in its own right.
- concepts/slowly-changing-dimension — named Expedia use case
motivating
MERGE INTO. - concepts/change-data-capture — the other named motivating workload.
Kafka Streams (2025-11-11 post)¶
- concepts/sub-topology — the Kafka Streams structural unit whose boundaries the partition assignor actually honors; the load-bearing concept in the 2025-11-11 post.
- concepts/partition-colocation — cross-topic colocation guarantee Expedia expected but didn't get until they unified their sub-topologies.
- concepts/cache-locality — the property a per-instance local Guava cache needed but lost when keys were sprayed across instances.
- patterns/shared-state-store-as-topology-unifier — Expedia's named fix: a Kafka Streams state store attached to both branches specifically to force sub-topology unification.
Trino / query-engine fleet (2026-03-24 post)¶
- patterns/query-gateway — the SQL-engine-fleet realisation of single-endpoint + workload-aware routing; Trino Gateway is the canonical instance.
- patterns/workload-segregated-clusters — Adhoc / ETL / BI (optionally + metadata) cluster segregation, each tuned to its own workload shape (concurrency × query-complexity profile); named by Expedia as "a common pattern for organizations using Trino at scale."
- patterns/routing-rules-as-config — rules as
name + description + condition + actions, evaluated per query
against
trinoQueryProperties+requestsurfaces; UI- managed post-Expedia's contribution. - patterns/no-downtime-cluster-upgrade — blue/green or canary cluster swap behind the gateway; one of the four headline gateway advantages.
- concepts/workload-aware-routing — route based on query shape (tables touched, body text, source application), not round-robin.
- concepts/single-endpoint-abstraction — one URL for the fleet; enables every other property the gateway provides.
- concepts/cluster-health-check — HEALTHY / UNHEALTHY /
PENDING trichotomy consumed by the Gateway's
RoutingManager; the "PENDING" state is load-bearing for distinguishing starting-up from broken.
Search / ranking experimentation (2026-02-17 post)¶
- concepts/interleaving-testing — the technique: mix ranking A and ranking B into one displayed list per user, attribute events back to the source ranking. Canonical wiki reference. Expedia's lodging-search team uses it as an accelerated screening layer upstream of A/B testing for ranking changes.
- concepts/lift-metric — the direction-of-preference metric aggregating per-search winning variants into a single scalar. Reported per-user by default; tracked independently for clicks and bookings.
- concepts/winning-indicator-t-test — Expedia's significance test: t-test on the distribution of per-search winning indicators against zero; "virtually the same results" as bootstrap percentile at "considerably faster" cost.
- concepts/bootstrap-percentile-method — the non-parametric baseline Expedia's t-test substitutes for at production scale.
- concepts/test-sensitivity — the headline axis on which interleaving wins vs A/B: detects random-property-pinning regression in a few days where A/B on CVR fails at full sample size.
- concepts/conversion-rate-uplift — the magnitude metric A/B reports but interleaving doesn't; still required downstream for launch decisions.
- patterns/interleaved-ranking-evaluation — the end-to-end loop (produce two rankings → interleave → attribute events → lift per click/booking → t-test significance → promote or kill).
- patterns/t-test-over-bootstrap — the generalised significance-speedup pattern extracted from Expedia's specific substitution.
LLM-assisted reliability engineering (2026-04-28 STAR post)¶
- patterns/static-prompt-chain-over-agent-loop — STAR is the canonical wiki instance: deliberately ship a fixed multi-step prompt chain rather than an agent loop, because the agent's failure modes are "less understood" at Expedia's current evaluation maturity.
- patterns/multi-step-rca-workflow — STAR's four-step shape (collect telemetry → per-metric analysis → aggregated RCA → insights + recommendations) as a generalised RCA pattern.
- concepts/prompt-chaining — the orchestration primitive STAR uses; canonicalised on the wiki via STAR's ingest.
- concepts/role-prompting — per-step persona/expertise framing (per-metric analyst at step 2, RCA engineer at step 3).
- concepts/generated-knowledge-prompting — step-3 aggregated RCA consumes step-2 per-metric analyses as generated knowledge.
- concepts/token-heavy-system — STAR's self-label; first wiki canonicalisation of the concept. Sized via GPT-4o tokenizer + 4k per-response cap.
- concepts/back-of-the-envelope-estimation — first-class design discipline for LLM feasibility analysis; STAR walks the facts + assumptions + enforced-limits framework explicitly.
- concepts/time-to-know-vs-time-to-recover — the KPIs STAR optimises for on its incident-investigation use case. TTK/TTR named verbatim by Expedia.
- concepts/automated-root-cause-analysis — the discipline STAR realises with LLMs; wiki-adjacent to Meta's heuristic-based Presto RCA analyzers.
- concepts/chaos-engineering — complement at Expedia for the failure-injection evaluation use case.
Recent articles¶
-
2026-04-28 — sources/2026-04-28-expedia-expedias-service-telemetry-analyzer (STAR (Service Telemetry Analyzer) — Expedia's FastAPI + Celery + Redis web service for automated root-cause analysis. Reads Datadog metrics via the Datadog API; calls the internal generative-AI proxy (authn/authz, multi-model, rate-limited) to run a four-step workflow — collect telemetry → per-metric analysis → aggregated RCA → insights + recommendations. Named prompt techniques: role prompting, chaining, generated-knowledge prompting. Named deliberate exclusions (the load-bearing architectural statement): no function calling / tool use, no MCP, no RAG, no short- or long-term memory, no conversational UI, no streaming platform — "avoids the additional and currently less understood failure modes of an agent." First-class token-heavy-system framing; BOTE against GPT-4o tokenizer with 4k per-response cap as the anchor. V0 used FastAPI
async/await+ background tasks; V1 migrated to Celery/Redis "as part of scaling up" — explicitly not Kafka because the traffic shape is request-response. Ingested signals: inbound/outbound traffic + errors, HTTP/gRPC/GraphQL latency, container CPU/memory, Kubernetes (restarts, probe failures), JVM (heap, GC) — tailored to Expedia's JVM-on-Kubernetes fleet. Five use cases: incident investigation ( TTK/TTR reduction), post-incident RCA, troubleshooting runbooks (container-restart example linked), performance optimization (JVM heap spike), failure-injection recommendation + analysis (complements Expedia's chaos engineering platform). Evaluation is qualitative / SME-gated, traced in Langfuse. Roadmap: MCP tool use, dependency-graph context, conversational UI, per-modality model selection. Design retrospective; no production numbers disclosed. Canonicalises patterns/static-prompt-chain-over-agent-loop + patterns/multi-step-rca-workflow on the wiki.) -
2026-03-24 — sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway (Operating Trino at scale behind Trino Gateway: gateway gives clients a single URL + workload-aware routing rules + blue/green-or-canary cluster upgrades + transparent capacity changes. Canonical segregation pattern is Adhoc / ETL / BI clusters — each tuned to a specific concurrency × query-complexity profile — with the gateway matching query shape to cluster shape. Three named routing-rule shapes walked in detail: large-table isolation, metadata-query offload (
select version()/show catalogs→ single-node metadata cluster; reduces BI dashboard extract failures), BI-source routing (X-Trino-Sourceheader contains "Tableau" / "Looker" → BI-optimised cluster). Expedia's four upstream UX contributions: UI for routing-rule view/edit (PR #433), source filter on history page (PR #551), three-state cluster-health display (HEALTHY / UNHEALTHY / PENDING, PR #601), full query text window (PR #740). Feature / contribution retrospective, no production numbers disclosed.) - 2026-02-17 — sources/2026-02-17-expedia-interleaving-for-accelerated-testing (Lodging-search team's interleaving framework as an accelerated alternative to A/B testing for ranking changes. Per-search click + booking events attributed back to source ranking A or B; per-search winning variant aggregated into a lift metric (user-level default). Significance via t-test on winning indicators substituting for bootstrap percentile method — "virtually the same results ... considerably faster." Sensitivity gain demonstrated on deliberately-deteriorated treatments: random-property pinning to positions 5–10 and top-slot reshuffling — interleaving detects the regression within days, A/B on CVR fails to detect pinning even at full sample size; click events significant "after the first day." Introduces systems/expedia-lodging-ranker as the subject-of-measurement system; patterns/interleaved-ranking-evaluation as the end-to-end pattern; patterns/t-test-over-bootstrap as the generalised significance-speedup pattern. Method / experimentation retrospective; no disclosed QPS / CVR / ranker architecture.)
- 2026-01-06 — sources/2026-01-06-expedia-powering-vector-embedding-capabilities (ML Platform team's centralized Embedding Store Service: vector-DB-backed, systems/feast-managed collection metadata (associated service + model/version), three ingestion modes — batch via Feast materialization on Spark, real-time Insert API, on-the-fly model invocation — all dual-written to online + offline stores with SQL-gated offline-→-online restore; similarity search
- hybrid search (vector + metadata filter) query surfaces. First wiki instance of Feast used as an embedding-collection registry. Platform-design overview; no production numbers disclosed.)
- 2025-11-11 — sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation (Kafka Streams production-debugging case: two-topic cache-sharing pipeline expected cross-topic partition colocation from identical partition counts + similar keys; observed identical keys processed on different instances in production; root cause was two implicitly-separate sub-topologies; fix was a shared state store attached to both branches to force sub-topology unification — named as a general pattern. No benchmarks, but the architectural lesson is cleanly stated: topology design directly influences partition assignment.)
- 2025-09-30 — sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
(Iceberg best-practices primer:
MERGE INTO(row-level, MOR) as the default;INSERT OVERWRITE(partition-level) reserved for genuine full-partition rewrites; compaction as the load-bearing caveat for any MOR deployment.)