EXPEDIA 2026-04-28 Tier 3

Expedia — Expedia's Service Telemetry Analyzer¶

Summary¶

Expedia's ML + reliability engineering teams built STAR (Service Telemetry Analyzer), a web-based FastAPI service that ingests observability metrics from Datadog and calls an internal generative-AI proxy to produce a root-cause analysis for a degraded or outage-ing service. The article is a design retrospective: it walks the deliberately-minimal architecture (no agent loop, no MCP tool use, no RAG, no long-term memory), the prompt-chaining choice, the token-heavy-system sizing approach, the Celery + Redis migration from FastAPI background tasks, and five named use cases — incident investigation, post-incident RCA, troubleshooting, performance optimization, and failure injection evaluation.

Key takeaways¶

Start deliberately below agent altitude. STAR is explicitly not an agent: no function-calling / tool use, no short-term or long-term memory, no RAG, no conversational interface. The design aim is "a) simple, b) precise (to a certain extent, considering the potential hallucinations of the models), and c) that avoids the additional and currently less understood failure modes of an agent." STAR walks a predefined multi-step process with domain-specific prompts. Canonical wiki counter-example to the 2025–26 agent-everywhere trend (Source: this post).
Prompt chaining over tool-augmented reasoning. Implementation is programmatic dialogue — the assistant's output at step N becomes part of the user prompt at step N+1. No MCP servers, no function-calling. The named prompt-engineering techniques are role prompting, chaining, and generated-knowledge prompting — linked to the original Anthropic / Prompting Guide definitions. (Source: this post.)
Multi-step workflow is fixed, not planned. STAR's pipeline is: (1) collect telemetry, (2) analyse each metric + metadata stream with domain-specific prompts and rules, (3) aggregate all analyses and run a final RCA step, (4) return insights + recommendations. The ordering is hardcoded; the LLM is a reasoning component inside a deterministic workflow, not the orchestrator. (Source: this post.)
Ingested data is tailored to Expedia's stack. STAR reads Datadog metrics covering the four RED/USE axes — inbound/outbound traffic + errors, HTTP / gRPC / GraphQL latency, container CPU / memory saturation — plus Kubernetes-level signals (container restarts, probe failures) and JVM metrics (heap, GC). The tailoring is explicit: "most services are backend JVM applications running on a Kubernetes-based compute platform." Generalisation to non-JVM / non-Kubernetes stacks would require a different analyzer set. (Source: this post.)
Token budgeting is an up-front design exercise. STAR is a token-heavy system: each workflow run sends fixed system prompts + chain-of-prompts + prompt content dependent on previous responses. Expedia did back-of-the-envelope estimation grounded in "facts, assumptions, and enforced limits" — using OpenAI's GPT-4o tokenizer as the unit of measure and capping each response to 4k tokens. The cap is the load-bearing assumption — without it, the estimation is unbounded. (Source: this post.)
Rate limits drive a Celery + Redis migration. Both Datadog and the internal GenAI proxy rate-limit. STAR initially used FastAPI's async/await + background tasks; it moved to Celery with Redis as broker + result backend "as part of scaling up". Rationale verbatim: "This architecture aligns with STAR's request-response flow, and we don't need a streaming platform like Kafka, at least for now." Canonical small-instance of "choose the simplest backend that matches the traffic shape" — request-response → Celery/Redis, not stream processor. (Source: this post.)
Five use cases, one engine. Incident investigation (time to know / time to recover reduction), post-incident RCA (initial draft for human review), troubleshooting (pre-documented runbooks encoded as STAR workflows; example result: https://gist.github.com/nikos912000/1e489021b406f682d70c14f3ebbad917), performance optimization (a JVM heap-spike case), and failure-injection recommendation + analysis — STAR as a complement to Expedia's chaos-engineering platform (which previously lacked an automatic experiment-result evaluator). (Source: this post.)
Evaluation is qualitative + SME-driven, traced in Langfuse. "Given the complexity of this domain, we mostly rely on qualitative human assessment which includes subject matter experts (SMEs) and users. We also use Langfuse for prompt management, evaluation, and tracing. The results so far have been promising." STAR joins Yelp BAA as the second wiki instance of Langfuse at production altitude. (Source: this post.)

Architecture — what STAR is and isn't¶

Web layer¶

FastAPI — synchronous and async/await request handlers; initially background tasks for long-running analyses.
Generative-AI proxy — Expedia's internal choke-point for LLM calls. Handles authn/authz and offers multiple models that are "constantly evaluate[d]" for quality + cost + perf. Stub-page: systems/expedia-generative-ai-proxy.
Datadog — Expedia's chosen metrics platform; STAR reads metrics via the Datadog API.

Workflow layer (fixed)¶

collect telemetry  ──▶  per-metric analysis    ──▶  aggregate RCA
   (Datadog API)         (domain-specific         (final reasoning
                          prompts + rules)          step)
                                                          │
                                                          ▼
                                                     insights +
                                                     recommendations

Model layer¶

LLMs via the generative-AI proxy, selected per task.
Named open direction: "it would be more effective to use specialised models for the different modalities of telemetry data and slower reasoning models for the final RCA." — the multi-model mixture is called out as a future efficiency lever, not something STAR ships today.

Scaling migration (request-response async)¶

V0: FastAPI async/await + FastAPI background tasks — "we initially used certain features from FastAPI such as async/await and background tasks".
V1: Celery + Redis as broker + result backend — "As part of scaling up, we moved to Celery with Redis acting as the broker and result backend to store the state and results of tasks."
Explicitly not Kafka: "This architecture aligns with STAR's request-response flow, and we don't need a streaming platform like Kafka, at least for now."

Deliberate exclusions¶

Feature	Status	Rationale
Function calling / tool use	No	Agent failure modes "less understood"
MCP servers	No	Same — STAR is not agentic
RAG	No	No runtime knowledge base; prompts are static templates
Short-term / long-term memory	No	No conversational state
Conversational UI	No	API-only; workflows are pre-defined
Streaming telemetry platform (Kafka)	No	Request-response traffic shape

The exclusions are load-bearing — STAR is positioned as the minimal system that still delivers a useful RCA draft, and the authors explicitly name this as a deliberate choice ahead of a future iteration towards a more agentic architecture.

Ingested data — Expedia's analyzer targets¶

Axis	Signals
Traffic	inbound + outbound request rate, error rate
Latency	HTTP + gRPC + GraphQL
Saturation	container-level CPU, memory
Kubernetes	container restarts, probe failures
JVM	heap usage, GC

Rationale for the JVM + Kubernetes bias: "our heterogeneous tech stack and the higher degree of standardization at the infrastructure layer" — the infrastructure layer is where the most signal is normalised across the fleet.

Operational numbers / caveats¶

Per-response cap: 4k tokens (used as the load-bearing number for BOTE token-budget estimation).
Context-window sizing: "differs between models and has been increasing over time" — the estimation is re-done per model.
Scale: "still small and the number of metrics per workflow is fixed" — no QPS, no per-analysis wall-clock times, no accuracy numbers disclosed.
Evaluation: qualitative only, SME-gated; no regression metrics quoted.

Five use cases walked¶

Incident investigation. The headline use case — reducing time to know / time to recover by rapid analysis + hypothesis evaluation over observability data. "We applied STAR to several services that experienced outages."
Post-incident RCA. Teams file a ticket for post-incident review; STAR runs over the affected service + time window and produces an initial draft for human supplementation.
Troubleshooting runbooks. Expedia's reliability engineering org already documented procedures in an internal reliability hub; STAR encodes these as workflows driven by metric data. The first addition: container-restart troubleshooting for Kubernetes — sample output at https://gist.github.com/nikos912000/1e489021b406f682d70c14f3ebbad917.
Performance optimization. Recent + still-evaluated use case — a service with sudden JVM heap spikes; STAR's analysis was reviewed and taken forward by the service owners.
Failure injection recommendation + analysis. Named complement to Expedia's chaos engineering platform (introduced in a 2018 Expedia post). The original platform "lacked a mechanism for the automatic evaluation of experimental results" — STAR fills that gap.

Next steps (named in post)¶

Identify high-leverage use cases.
Improve testing + evaluation.
Move from static pipeline toward sophisticated multi-agent architecture.
Add MCP tool use for dynamic data access.
Add more context: service documentation, metadata, service dependency graph.
Expose a conversational interface.

All of these map STAR's current deliberate exclusions back onto a roadmap. The wiki treats STAR as a canonical instance of "start below agent altitude; graduate when the precision / failure-mode envelope permits".

Caveats¶

No production numbers disclosed (QPS, wall-clock, accuracy, cost per analysis, token consumption).
No model names disclosed — the proxy abstracts model choice; Expedia's language is "advanced off-the-shelf AI models".
Evaluation is qualitative; no held-out accuracy / FP / FN numbers.
Subject-matter scope is deliberately narrow: RED/USE metrics + JVM + Kubernetes — not logs, not traces (yet), not custom business metrics.
The Anthropic role-prompting + Prompting Guide chaining + generated-knowledge references are load-bearing; the post cites them via direct external link, not as internal Expedia concepts.

Source¶

systems/expedia-star — the subject system.
systems/expedia-generative-ai-proxy — Expedia's internal LLM proxy (choke-point, authn/authz, multi-model).
systems/fastapi — the web framework STAR runs on.
systems/celery — the task queue STAR migrated to.
systems/redis — broker + result backend.
systems/datadog — source of ingested metrics.
systems/langfuse — prompt management + eval + tracing.
systems/model-context-protocol — the tool-use protocol STAR deliberately doesn't use.
systems/kubernetes — the deployment platform; source of container-restart / probe-failure signals.
concepts/automated-root-cause-analysis — the concept STAR implements at Expedia altitude.
concepts/prompt-chaining — STAR's named orchestration technique.
concepts/role-prompting — one of STAR's named prompt engineering techniques.
concepts/generated-knowledge-prompting — another.
concepts/context-engineering — the broader discipline STAR participates in with deliberately minimal surface.
concepts/token-heavy-system — STAR's canonical self-label.
concepts/time-to-know-vs-time-to-recover — the operational KPIs STAR optimises for.
concepts/back-of-the-envelope-estimation — Expedia's token sizing discipline.
concepts/chaos-engineering — complement at Expedia for failure-injection analysis.
patterns/static-prompt-chain-over-agent-loop — the generalised pattern STAR canonicalises.
patterns/multi-step-rca-workflow — STAR's 4-step workflow shape.
companies/expedia