Expedia Group¶
Expedia Group Tech blog (medium.com/expedia-group-tech). Tier-3 source on the sysdesign-wiki. Mix of product-PR, hiring, and substantive architecture content; the architecture posts skew toward data platform (Iceberg / Spark / Trino), streaming (Kafka Streams), ML platform (Feast / vector DBs / embeddings), search/ranking experimentation (interleaving as an A/B alternative), and testing/observability practices.
Key systems¶
- systems/expedia-embedding-store — Expedia ML Platform team's centralized embedding platform: vector-database-backed service with standardized APIs, Feast-managed collection metadata, three ingestion modes (batch via Feast materialization on Spark, Insert API, on-the-fly model invocation), online/offline storage with offline-→-online restore, and similarity + hybrid search surfaces. Canonical wiki reference for the centralized-embedding-platform pattern and for Feast being used as an embedding-collection registry rather than only a feature-view registry.
- systems/feast — used by the Embedding Store as its metadata layer (associated service + model + version per collection) — extends Feast's usage scope from features to embeddings.
- systems/apache-iceberg — the open table format the 2025-09-30
MERGE INTOvsINSERT OVERWRITEpost prescribes row-level update strategy for; Expedia is one of the named Iceberg shops whose public guidance informs the canonical patterns. - systems/kafka / systems/kafka-streams — the streaming substrate behind the 2025-11-11 sub-topology / partition-colocation post. Expedia runs Kafka Streams in production for near-real-time pipelines; the post is a production-debugging retrospective that articulates a subtle Kafka Streams guarantee (partition colocation is sub-topology-scoped, not topic-count-scoped) and a structural fix (patterns/shared-state-store-as-topology-unifier).
- systems/trino / systems/trino-gateway — the distributed
SQL engine + its open-source query gateway that Expedia runs in
production (2026-03-24 post). Expedia operates a workload-segregated
Trino cluster fleet (Adhoc / ETL / BI) fronted by Trino Gateway,
which routes queries via rules that inspect tables touched, query
body, and
X-Trino-Sourceheaders (Tableau / Looker detection). Expedia also contributed four operator-UX features upstream — a UI for routing-rule management, a source filter on the history page, three-state cluster-health display, and a full-query-text window — promoting Trino Gateway from CLI-managed to UI-managed. - systems/expedia-lodging-ranker — the ranking algorithm behind Expedia's lodging search results. Architecture undisclosed; the 2026-02-17 post exposes only its experimentation harness: interleaving-based evaluation with click + booking attribution, lift reported at user level, significance via winning-indicator t-test as a fast substitute for bootstrap percentile. Synthetic regressions (random-property pinning to slots 5–10, top-slot reshuffling) detected by interleaving within days; A/B testing on CVR uplift fails to detect the pinning regression even at full sample size.
Key patterns / concepts¶
ML platform / embeddings (2026-01-06 Embedding Store post)¶
- patterns/centralized-embedding-platform — single org-wide vector-embedding service behind standardized APIs; the Expedia Embedding Store is the canonical wiki instance.
- patterns/embedding-ingestion-modes — batch (Feast materialization on Spark) + Insert API + on-the-fly model invocation; the three complementary producer lanes.
- patterns/dual-write-online-offline — every ingest lands in both online (vector DB) and offline (historical repository) stores regardless of mode; enables offline-→-online restore.
- concepts/embedding-collection — the organizational unit of the vector DB; schema + model + distance metric pinned; the platform's governance unit.
- concepts/hybrid-search — vector similarity combined with
structured metadata filters (
price < 100,category = electronics); wiki-distinct from concepts/hybrid-retrieval-bm25-vectors (BM25 + dense-vector fusion).
Iceberg / data lake (2025-09-30 post)¶
- patterns/merge-into-over-insert-overwrite — prefer
MERGE INTO(row-level) overINSERT OVERWRITE(partition-level) on Iceberg; MOR + compaction as the paired strategy. - concepts/merge-on-read — the Iceberg write-optimized strategy
(delta files merged at query time) that makes
MERGE INTOa win for CDC / SCD / incremental workloads. - concepts/copy-on-write-merge — the compaction strategy that keeps MOR healthy; also an Iceberg update strategy in its own right.
- concepts/slowly-changing-dimension — named Expedia use case
motivating
MERGE INTO. - concepts/change-data-capture — the other named motivating workload.
Kafka Streams (2025-11-11 post)¶
- concepts/sub-topology — the Kafka Streams structural unit whose boundaries the partition assignor actually honors; the load-bearing concept in the 2025-11-11 post.
- concepts/partition-colocation — cross-topic colocation guarantee Expedia expected but didn't get until they unified their sub-topologies.
- concepts/cache-locality — the property a per-instance local Guava cache needed but lost when keys were sprayed across instances.
- patterns/shared-state-store-as-topology-unifier — Expedia's named fix: a Kafka Streams state store attached to both branches specifically to force sub-topology unification.
Trino / query-engine fleet (2026-03-24 post)¶
- patterns/query-gateway — the SQL-engine-fleet realisation of single-endpoint + workload-aware routing; Trino Gateway is the canonical instance.
- patterns/workload-segregated-clusters — Adhoc / ETL / BI (optionally + metadata) cluster segregation, each tuned to its own workload shape (concurrency × query-complexity profile); named by Expedia as "a common pattern for organizations using Trino at scale."
- patterns/routing-rules-as-config — rules as
name + description + condition + actions, evaluated per query
against
trinoQueryProperties+requestsurfaces; UI- managed post-Expedia's contribution. - patterns/no-downtime-cluster-upgrade — blue/green or canary cluster swap behind the gateway; one of the four headline gateway advantages.
- concepts/workload-aware-routing — route based on query shape (tables touched, body text, source application), not round-robin.
- concepts/single-endpoint-abstraction — one URL for the fleet; enables every other property the gateway provides.
- concepts/cluster-health-check — HEALTHY / UNHEALTHY /
PENDING trichotomy consumed by the Gateway's
RoutingManager; the "PENDING" state is load-bearing for distinguishing starting-up from broken.
Search / ranking experimentation (2026-02-17 post)¶
- concepts/interleaving-testing — the technique: mix ranking A and ranking B into one displayed list per user, attribute events back to the source ranking. Canonical wiki reference. Expedia's lodging-search team uses it as an accelerated screening layer upstream of A/B testing for ranking changes.
- concepts/lift-metric — the direction-of-preference metric aggregating per-search winning variants into a single scalar. Reported per-user by default; tracked independently for clicks and bookings.
- concepts/winning-indicator-t-test — Expedia's significance test: t-test on the distribution of per-search winning indicators against zero; "virtually the same results" as bootstrap percentile at "considerably faster" cost.
- concepts/bootstrap-percentile-method — the non-parametric baseline Expedia's t-test substitutes for at production scale.
- concepts/test-sensitivity — the headline axis on which interleaving wins vs A/B: detects random-property-pinning regression in a few days where A/B on CVR fails at full sample size.
- concepts/conversion-rate-uplift — the magnitude metric A/B reports but interleaving doesn't; still required downstream for launch decisions.
- patterns/interleaved-ranking-evaluation — the end-to-end loop (produce two rankings → interleave → attribute events → lift per click/booking → t-test significance → promote or kill).
- patterns/t-test-over-bootstrap — the generalised significance-speedup pattern extracted from Expedia's specific substitution.
Recent articles¶
- 2026-03-24 — sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway
(Operating Trino at scale behind
Trino Gateway: gateway gives clients a
single URL + workload-aware routing rules + blue/green-or-canary
cluster upgrades + transparent capacity changes. Canonical
segregation pattern is Adhoc / ETL / BI clusters — each tuned to
a specific concurrency × query-complexity profile — with the
gateway matching query shape to cluster shape. Three named
routing-rule shapes walked in detail: large-table isolation,
metadata-query offload (
select version()/show catalogs→ single-node metadata cluster; reduces BI dashboard extract failures), BI-source routing (X-Trino-Sourceheader contains "Tableau" / "Looker" → BI-optimised cluster). Expedia's four upstream UX contributions: UI for routing-rule view/edit (PR #433), source filter on history page (PR #551), three-state cluster-health display (HEALTHY / UNHEALTHY / PENDING, PR #601), full query text window (PR #740). Feature / contribution retrospective, no production numbers disclosed.) - 2026-02-17 — sources/2026-02-17-expedia-interleaving-for-accelerated-testing (Lodging-search team's interleaving framework as an accelerated alternative to A/B testing for ranking changes. Per-search click + booking events attributed back to source ranking A or B; per-search winning variant aggregated into a lift metric (user-level default). Significance via t-test on winning indicators substituting for bootstrap percentile method — "virtually the same results ... considerably faster." Sensitivity gain demonstrated on deliberately-deteriorated treatments: random-property pinning to positions 5–10 and top-slot reshuffling — interleaving detects the regression within days, A/B on CVR fails to detect pinning even at full sample size; click events significant "after the first day." Introduces systems/expedia-lodging-ranker as the subject-of-measurement system; patterns/interleaved-ranking-evaluation as the end-to-end pattern; patterns/t-test-over-bootstrap as the generalised significance-speedup pattern. Method / experimentation retrospective; no disclosed QPS / CVR / ranker architecture.)
- 2026-01-06 — sources/2026-01-06-expedia-powering-vector-embedding-capabilities (ML Platform team's centralized Embedding Store Service: vector-DB-backed, systems/feast-managed collection metadata (associated service + model/version), three ingestion modes — batch via Feast materialization on Spark, real-time Insert API, on-the-fly model invocation — all dual-written to online + offline stores with SQL-gated offline-→-online restore; similarity search
- hybrid search (vector + metadata filter) query surfaces. First wiki instance of Feast used as an embedding-collection registry. Platform-design overview; no production numbers disclosed.)
- 2025-11-11 — sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation (Kafka Streams production-debugging case: two-topic cache-sharing pipeline expected cross-topic partition colocation from identical partition counts + similar keys; observed identical keys processed on different instances in production; root cause was two implicitly-separate sub-topologies; fix was a shared state store attached to both branches to force sub-topology unification — named as a general pattern. No benchmarks, but the architectural lesson is cleanly stated: topology design directly influences partition assignment.)
- 2025-09-30 — sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite
(Iceberg best-practices primer:
MERGE INTO(row-level, MOR) as the default;INSERT OVERWRITE(partition-level) reserved for genuine full-partition rewrites; compaction as the load-bearing caveat for any MOR deployment.)