Skip to content

Expedia Group

Expedia Group Tech blog (medium.com/expedia-group-tech). Tier-3 source on the sysdesign-wiki. Mix of product-PR, hiring, and substantive architecture content; the architecture posts skew toward data platform (Iceberg / Spark / Trino), streaming (Kafka Streams), ML platform (Feast / vector DBs / embeddings), search/ranking experimentation (interleaving as an A/B alternative), LLM-assisted reliability engineering (STAR — Service Telemetry Analyzer), and testing/observability practices.

Key systems

  • systems/expedia-embedding-store — Expedia ML Platform team's centralized embedding platform: vector-database-backed service with standardized APIs, Feast-managed collection metadata, three ingestion modes (batch via Feast materialization on Spark, Insert API, on-the-fly model invocation), online/offline storage with offline-→-online restore, and similarity + hybrid search surfaces. Canonical wiki reference for the centralized-embedding-platform pattern and for Feast being used as an embedding-collection registry rather than only a feature-view registry.
  • systems/feast — used by the Embedding Store as its metadata layer (associated service + model + version per collection) — extends Feast's usage scope from features to embeddings.
  • systems/apache-iceberg — the open table format the 2025-09-30 MERGE INTO vs INSERT OVERWRITE post prescribes row-level update strategy for; Expedia is one of the named Iceberg shops whose public guidance informs the canonical patterns.
  • systems/kafka / systems/kafka-streams — the streaming substrate behind the 2025-11-11 sub-topology / partition-colocation post. Expedia runs Kafka Streams in production for near-real-time pipelines; the post is a production-debugging retrospective that articulates a subtle Kafka Streams guarantee (partition colocation is sub-topology-scoped, not topic-count-scoped) and a structural fix (patterns/shared-state-store-as-topology-unifier).
  • systems/trino / systems/trino-gateway — the distributed SQL engine + its open-source query gateway that Expedia runs in production (2026-03-24 post). Expedia operates a workload-segregated Trino cluster fleet (Adhoc / ETL / BI) fronted by Trino Gateway, which routes queries via rules that inspect tables touched, query body, and X-Trino-Source headers (Tableau / Looker detection). Expedia also contributed four operator-UX features upstream — a UI for routing-rule management, a source filter on the history page, three-state cluster-health display, and a full-query-text window — promoting Trino Gateway from CLI-managed to UI-managed.
  • systems/expedia-lodging-ranker — the ranking algorithm behind Expedia's lodging search results. Architecture undisclosed; the 2026-02-17 post exposes only its experimentation harness: interleaving-based evaluation with click + booking attribution, lift reported at user level, significance via winning-indicator t-test as a fast substitute for bootstrap percentile. Synthetic regressions (random-property pinning to slots 5–10, top-slot reshuffling) detected by interleaving within days; A/B testing on CVR uplift fails to detect the pinning regression even at full sample size.
  • systems/expedia-starSTAR (Service Telemetry Analyzer), a FastAPI web service (with Celery + Redis task queue) that reads Datadog metrics and calls Expedia's internal generative-AI proxy to produce automated root-cause analysis for degraded services. Deliberately not an agent — no function calling, no MCP, no RAG, no memory — a fixed four-step chain (collect telemetry → per-metric analysis → aggregated RCA → return insights) with role / chaining / generated-knowledge prompt engineering. Five named use cases: incident investigation, post-incident RCA, Kubernetes/JVM troubleshooting runbooks, performance optimization, and failure-injection evaluation. Tailored to Expedia's infrastructure layer (Kubernetes + JVM). Canonical wiki instance of patterns/static-prompt-chain-over-agent-loop and the wiki's first in-production Langfuse-traced incident-RCA pipeline outside Yelp BAA.
  • systems/expedia-generative-ai-proxy — Expedia's internal LLM choke point: centralised authn/authz, multi-model access, rate-limited. Upstream of every STAR prompt.

Key patterns / concepts

ML platform / embeddings (2026-01-06 Embedding Store post)

Iceberg / data lake (2025-09-30 post)

Kafka Streams (2025-11-11 post)

  • concepts/sub-topology — the Kafka Streams structural unit whose boundaries the partition assignor actually honors; the load-bearing concept in the 2025-11-11 post.
  • concepts/partition-colocation — cross-topic colocation guarantee Expedia expected but didn't get until they unified their sub-topologies.
  • concepts/cache-locality — the property a per-instance local Guava cache needed but lost when keys were sprayed across instances.
  • patterns/shared-state-store-as-topology-unifier — Expedia's named fix: a Kafka Streams state store attached to both branches specifically to force sub-topology unification.

Trino / query-engine fleet (2026-03-24 post)

  • patterns/query-gateway — the SQL-engine-fleet realisation of single-endpoint + workload-aware routing; Trino Gateway is the canonical instance.
  • patterns/workload-segregated-clusters — Adhoc / ETL / BI (optionally + metadata) cluster segregation, each tuned to its own workload shape (concurrency × query-complexity profile); named by Expedia as "a common pattern for organizations using Trino at scale."
  • patterns/routing-rules-as-config — rules as name + description + condition + actions, evaluated per query against trinoQueryProperties + request surfaces; UI- managed post-Expedia's contribution.
  • patterns/no-downtime-cluster-upgrade — blue/green or canary cluster swap behind the gateway; one of the four headline gateway advantages.
  • concepts/workload-aware-routing — route based on query shape (tables touched, body text, source application), not round-robin.
  • concepts/single-endpoint-abstraction — one URL for the fleet; enables every other property the gateway provides.
  • concepts/cluster-health-check — HEALTHY / UNHEALTHY / PENDING trichotomy consumed by the Gateway's RoutingManager; the "PENDING" state is load-bearing for distinguishing starting-up from broken.

Search / ranking experimentation (2026-02-17 post)

  • concepts/interleaving-testing — the technique: mix ranking A and ranking B into one displayed list per user, attribute events back to the source ranking. Canonical wiki reference. Expedia's lodging-search team uses it as an accelerated screening layer upstream of A/B testing for ranking changes.
  • concepts/lift-metric — the direction-of-preference metric aggregating per-search winning variants into a single scalar. Reported per-user by default; tracked independently for clicks and bookings.
  • concepts/winning-indicator-t-test — Expedia's significance test: t-test on the distribution of per-search winning indicators against zero; "virtually the same results" as bootstrap percentile at "considerably faster" cost.
  • concepts/bootstrap-percentile-method — the non-parametric baseline Expedia's t-test substitutes for at production scale.
  • concepts/test-sensitivity — the headline axis on which interleaving wins vs A/B: detects random-property-pinning regression in a few days where A/B on CVR fails at full sample size.
  • concepts/conversion-rate-uplift — the magnitude metric A/B reports but interleaving doesn't; still required downstream for launch decisions.
  • patterns/interleaved-ranking-evaluation — the end-to-end loop (produce two rankings → interleave → attribute events → lift per click/booking → t-test significance → promote or kill).
  • patterns/t-test-over-bootstrap — the generalised significance-speedup pattern extracted from Expedia's specific substitution.

LLM-assisted reliability engineering (2026-04-28 STAR post)

Recent articles

  • 2026-04-28 — sources/2026-04-28-expedia-expedias-service-telemetry-analyzer (STAR (Service Telemetry Analyzer) — Expedia's FastAPI + Celery + Redis web service for automated root-cause analysis. Reads Datadog metrics via the Datadog API; calls the internal generative-AI proxy (authn/authz, multi-model, rate-limited) to run a four-step workflow — collect telemetry → per-metric analysis → aggregated RCA → insights + recommendations. Named prompt techniques: role prompting, chaining, generated-knowledge prompting. Named deliberate exclusions (the load-bearing architectural statement): no function calling / tool use, no MCP, no RAG, no short- or long-term memory, no conversational UI, no streaming platform — "avoids the additional and currently less understood failure modes of an agent." First-class token-heavy-system framing; BOTE against GPT-4o tokenizer with 4k per-response cap as the anchor. V0 used FastAPI async/await + background tasks; V1 migrated to Celery/Redis "as part of scaling up" — explicitly not Kafka because the traffic shape is request-response. Ingested signals: inbound/outbound traffic + errors, HTTP/gRPC/GraphQL latency, container CPU/memory, Kubernetes (restarts, probe failures), JVM (heap, GC) — tailored to Expedia's JVM-on-Kubernetes fleet. Five use cases: incident investigation ( TTK/TTR reduction), post-incident RCA, troubleshooting runbooks (container-restart example linked), performance optimization (JVM heap spike), failure-injection recommendation + analysis (complements Expedia's chaos engineering platform). Evaluation is qualitative / SME-gated, traced in Langfuse. Roadmap: MCP tool use, dependency-graph context, conversational UI, per-modality model selection. Design retrospective; no production numbers disclosed. Canonicalises patterns/static-prompt-chain-over-agent-loop + patterns/multi-step-rca-workflow on the wiki.)

  • 2026-03-24 — sources/2026-03-24-expedia-operating-trino-at-scale-with-trino-gateway (Operating Trino at scale behind Trino Gateway: gateway gives clients a single URL + workload-aware routing rules + blue/green-or-canary cluster upgrades + transparent capacity changes. Canonical segregation pattern is Adhoc / ETL / BI clusters — each tuned to a specific concurrency × query-complexity profile — with the gateway matching query shape to cluster shape. Three named routing-rule shapes walked in detail: large-table isolation, metadata-query offload (select version() / show catalogs → single-node metadata cluster; reduces BI dashboard extract failures), BI-source routing (X-Trino-Source header contains "Tableau" / "Looker" → BI-optimised cluster). Expedia's four upstream UX contributions: UI for routing-rule view/edit (PR #433), source filter on history page (PR #551), three-state cluster-health display (HEALTHY / UNHEALTHY / PENDING, PR #601), full query text window (PR #740). Feature / contribution retrospective, no production numbers disclosed.)

  • 2026-02-17 — sources/2026-02-17-expedia-interleaving-for-accelerated-testing (Lodging-search team's interleaving framework as an accelerated alternative to A/B testing for ranking changes. Per-search click + booking events attributed back to source ranking A or B; per-search winning variant aggregated into a lift metric (user-level default). Significance via t-test on winning indicators substituting for bootstrap percentile method"virtually the same results ... considerably faster." Sensitivity gain demonstrated on deliberately-deteriorated treatments: random-property pinning to positions 5–10 and top-slot reshuffling — interleaving detects the regression within days, A/B on CVR fails to detect pinning even at full sample size; click events significant "after the first day." Introduces systems/expedia-lodging-ranker as the subject-of-measurement system; patterns/interleaved-ranking-evaluation as the end-to-end pattern; patterns/t-test-over-bootstrap as the generalised significance-speedup pattern. Method / experimentation retrospective; no disclosed QPS / CVR / ranker architecture.)
  • 2026-01-06 — sources/2026-01-06-expedia-powering-vector-embedding-capabilities (ML Platform team's centralized Embedding Store Service: vector-DB-backed, systems/feast-managed collection metadata (associated service + model/version), three ingestion modes — batch via Feast materialization on Spark, real-time Insert API, on-the-fly model invocation — all dual-written to online + offline stores with SQL-gated offline-→-online restore; similarity search
  • hybrid search (vector + metadata filter) query surfaces. First wiki instance of Feast used as an embedding-collection registry. Platform-design overview; no production numbers disclosed.)
  • 2025-11-11 — sources/2025-11-11-expedia-kafka-streams-sub-topology-partition-colocation (Kafka Streams production-debugging case: two-topic cache-sharing pipeline expected cross-topic partition colocation from identical partition counts + similar keys; observed identical keys processed on different instances in production; root cause was two implicitly-separate sub-topologies; fix was a shared state store attached to both branches to force sub-topology unification — named as a general pattern. No benchmarks, but the architectural lesson is cleanly stated: topology design directly influences partition assignment.)
  • 2025-09-30 — sources/2025-09-30-expedia-prefer-merge-into-over-insert-overwrite (Iceberg best-practices primer: MERGE INTO (row-level, MOR) as the default; INSERT OVERWRITE (partition-level) reserved for genuine full-partition rewrites; compaction as the load-bearing caveat for any MOR deployment.)
Last updated · 542 distilled / 1,571 read