Skip to content

Expedia Group

Expedia Group Tech blog (medium.com/expedia-group-tech). Tier-3 source on the sysdesign-wiki. Mix of product-PR, hiring, and substantive architecture content; the architecture posts skew toward data platform (Iceberg / Spark / Trino), streaming (Kafka Streams), ML platform (Feast / vector DBs / embeddings), search/ranking experimentation (interleaving as an A/B alternative), and testing/observability practices.

Key systems

  • systems/expedia-embedding-store — Expedia ML Platform team's centralized embedding platform: vector-database-backed service with standardized APIs, Feast-managed collection metadata, three ingestion modes (batch via Feast materialization on Spark, Insert API, on-the-fly model invocation), online/offline storage with offline-→-online restore, and similarity + hybrid search surfaces. Canonical wiki reference for the centralized-embedding-platform pattern and for Feast being used as an embedding-collection registry rather than only a feature-view registry.
  • systems/feast — used by the Embedding Store as its metadata layer (associated service + model + version per collection) — extends Feast's usage scope from features to embeddings.
  • systems/apache-iceberg — the open table format the 2025-09-30 MERGE INTO vs INSERT OVERWRITE post prescribes row-level update strategy for; Expedia is one of the named Iceberg shops whose public guidance informs the canonical patterns.
  • systems/kafka / systems/kafka-streams — the streaming substrate behind the 2025-11-11 sub-topology / partition-colocation post. Expedia runs Kafka Streams in production for near-real-time pipelines; the post is a production-debugging retrospective that articulates a subtle Kafka Streams guarantee (partition colocation is sub-topology-scoped, not topic-count-scoped) and a structural fix (patterns/shared-state-store-as-topology-unifier).
  • systems/trino / systems/trino-gateway — the distributed SQL engine + its open-source query gateway that Expedia runs in production (2026-03-24 post). Expedia operates a workload-segregated Trino cluster fleet (Adhoc / ETL / BI) fronted by Trino Gateway, which routes queries via rules that inspect tables touched, query body, and X-Trino-Source headers (Tableau / Looker detection). Expedia also contributed four operator-UX features upstream — a UI for routing-rule management, a source filter on the history page, three-state cluster-health display, and a full-query-text window — promoting Trino Gateway from CLI-managed to UI-managed.
  • systems/expedia-lodging-ranker — the ranking algorithm behind Expedia's lodging search results. Architecture undisclosed; the 2026-02-17 post exposes only its experimentation harness: interleaving-based evaluation with click + booking attribution, lift reported at user level, significance via winning-indicator t-test as a fast substitute for bootstrap percentile. Synthetic regressions (random-property pinning to slots 5–10, top-slot reshuffling) detected by interleaving within days; A/B testing on CVR uplift fails to detect the pinning regression even at full sample size.

Key patterns / concepts

ML platform / embeddings (2026-01-06 Embedding Store post)

Iceberg / data lake (2025-09-30 post)

Kafka Streams (2025-11-11 post)

  • concepts/sub-topology — the Kafka Streams structural unit whose boundaries the partition assignor actually honors; the load-bearing concept in the 2025-11-11 post.
  • concepts/partition-colocation — cross-topic colocation guarantee Expedia expected but didn't get until they unified their sub-topologies.
  • concepts/cache-locality — the property a per-instance local Guava cache needed but lost when keys were sprayed across instances.
  • patterns/shared-state-store-as-topology-unifier — Expedia's named fix: a Kafka Streams state store attached to both branches specifically to force sub-topology unification.

Trino / query-engine fleet (2026-03-24 post)

  • patterns/query-gateway — the SQL-engine-fleet realisation of single-endpoint + workload-aware routing; Trino Gateway is the canonical instance.
  • patterns/workload-segregated-clusters — Adhoc / ETL / BI (optionally + metadata) cluster segregation, each tuned to its own workload shape (concurrency × query-complexity profile); named by Expedia as "a common pattern for organizations using Trino at scale."
  • patterns/routing-rules-as-config — rules as name + description + condition + actions, evaluated per query against trinoQueryProperties + request surfaces; UI- managed post-Expedia's contribution.
  • patterns/no-downtime-cluster-upgrade — blue/green or canary cluster swap behind the gateway; one of the four headline gateway advantages.
  • concepts/workload-aware-routing — route based on query shape (tables touched, body text, source application), not round-robin.
  • concepts/single-endpoint-abstraction — one URL for the fleet; enables every other property the gateway provides.
  • concepts/cluster-health-check — HEALTHY / UNHEALTHY / PENDING trichotomy consumed by the Gateway's RoutingManager; the "PENDING" state is load-bearing for distinguishing starting-up from broken.

Search / ranking experimentation (2026-02-17 post)

  • concepts/interleaving-testing — the technique: mix ranking A and ranking B into one displayed list per user, attribute events back to the source ranking. Canonical wiki reference. Expedia's lodging-search team uses it as an accelerated screening layer upstream of A/B testing for ranking changes.
  • concepts/lift-metric — the direction-of-preference metric aggregating per-search winning variants into a single scalar. Reported per-user by default; tracked independently for clicks and bookings.
  • concepts/winning-indicator-t-test — Expedia's significance test: t-test on the distribution of per-search winning indicators against zero; "virtually the same results" as bootstrap percentile at "considerably faster" cost.
  • concepts/bootstrap-percentile-method — the non-parametric baseline Expedia's t-test substitutes for at production scale.
  • concepts/test-sensitivity — the headline axis on which interleaving wins vs A/B: detects random-property-pinning regression in a few days where A/B on CVR fails at full sample size.
  • concepts/conversion-rate-uplift — the magnitude metric A/B reports but interleaving doesn't; still required downstream for launch decisions.
  • patterns/interleaved-ranking-evaluation — the end-to-end loop (produce two rankings → interleave → attribute events → lift per click/booking → t-test significance → promote or kill).
  • patterns/t-test-over-bootstrap — the generalised significance-speedup pattern extracted from Expedia's specific substitution.

Recent articles

  • 2026-03-24 — sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway (Operating Trino at scale behind Trino Gateway: gateway gives clients a single URL + workload-aware routing rules + blue/green-or-canary cluster upgrades + transparent capacity changes. Canonical segregation pattern is Adhoc / ETL / BI clusters — each tuned to a specific concurrency × query-complexity profile — with the gateway matching query shape to cluster shape. Three named routing-rule shapes walked in detail: large-table isolation, metadata-query offload (select version() / show catalogs → single-node metadata cluster; reduces BI dashboard extract failures), BI-source routing (X-Trino-Source header contains "Tableau" / "Looker" → BI-optimised cluster). Expedia's four upstream UX contributions: UI for routing-rule view/edit (PR #433), source filter on history page (PR #551), three-state cluster-health display (HEALTHY / UNHEALTHY / PENDING, PR #601), full query text window (PR #740). Feature / contribution retrospective, no production numbers disclosed.)
  • 2026-02-17 — sources/2026-02-17-expedia-interleaving-for-accelerated-testing (Lodging-search team's interleaving framework as an accelerated alternative to A/B testing for ranking changes. Per-search click + booking events attributed back to source ranking A or B; per-search winning variant aggregated into a lift metric (user-level default). Significance via t-test on winning indicators substituting for bootstrap percentile method"virtually the same results ... considerably faster." Sensitivity gain demonstrated on deliberately-deteriorated treatments: random-property pinning to positions 5–10 and top-slot reshuffling — interleaving detects the regression within days, A/B on CVR fails to detect pinning even at full sample size; click events significant "after the first day." Introduces systems/expedia-lodging-ranker as the subject-of-measurement system; patterns/interleaved-ranking-evaluation as the end-to-end pattern; patterns/t-test-over-bootstrap as the generalised significance-speedup pattern. Method / experimentation retrospective; no disclosed QPS / CVR / ranker architecture.)
  • 2026-01-06 — sources/2026-01-06-expedia-powering-vector-embedding-capabilities (ML Platform team's centralized Embedding Store Service: vector-DB-backed, systems/feast-managed collection metadata (associated service + model/version), three ingestion modes — batch via Feast materialization on Spark, real-time Insert API, on-the-fly model invocation — all dual-written to online + offline stores with SQL-gated offline-→-online restore; similarity search
  • hybrid search (vector + metadata filter) query surfaces. First wiki instance of Feast used as an embedding-collection registry. Platform-design overview; no production numbers disclosed.)
  • 2025-11-11 — sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation (Kafka Streams production-debugging case: two-topic cache-sharing pipeline expected cross-topic partition colocation from identical partition counts + similar keys; observed identical keys processed on different instances in production; root cause was two implicitly-separate sub-topologies; fix was a shared state store attached to both branches to force sub-topology unification — named as a general pattern. No benchmarks, but the architectural lesson is cleanly stated: topology design directly influences partition assignment.)
  • 2025-09-30 — sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite (Iceberg best-practices primer: MERGE INTO (row-level, MOR) as the default; INSERT OVERWRITE (partition-level) reserved for genuine full-partition rewrites; compaction as the load-bearing caveat for any MOR deployment.)
Last updated · 200 distilled / 1,178 read