SYSTEM Cited by 1 source
Expedia STAR (Service Telemetry Analyzer)¶
Definition¶
STAR (Service Telemetry Analyzer) is Expedia's web service for investigating service degradations and outages using observability metrics and LLMs. It is a FastAPI service that reads metrics from Datadog and calls Expedia's internal generative-AI proxy to run a fixed multi-step workflow — telemetry collection → per-metric analysis → aggregated root-cause analysis → insights + recommendations.
STAR is deliberately not an agent: no function calling, no MCP tool use, no short- or long-term memory, no RAG, no conversational UI. The design goal is "a) simple, b) precise (to a certain extent, considering the potential hallucinations of the models), and c) that avoids the additional and currently less understood failure modes of an agent" (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).
Architecture¶
Web tier¶
- FastAPI — API layer + web server.
- Celery + Redis —
task queue + broker + result backend. STAR V0 used FastAPI's
async/await+ background tasks; V1 migrated to Celery/Redis "as part of scaling up" to decouple analysis latency from the HTTP request and to absorb rate-limiter round-trips against Datadog + the GenAI proxy. - Not Kafka: "This architecture aligns with STAR's request-response flow, and we don't need a streaming platform like Kafka, at least for now."
Integrations¶
- Datadog — Expedia's chosen metrics platform; STAR reads metrics via the Datadog API.
- Generative-AI proxy — internal LLM choke point. Handles authn/authz and exposes multiple models that Expedia "constantly evaluate[s]" for quality + cost + performance. STAR is model-agnostic at the application layer; the proxy abstracts model choice.
- Langfuse — prompt management, evaluation, and tracing.
Workflow¶
collect telemetry ──▶ per-metric analysis ──▶ aggregate RCA
(Datadog API) (domain prompts + (reasoning
rules, per signal) model run)
│
▼
insights +
recommendations
The ordering is hardcoded. The LLM is a reasoning component inside a deterministic workflow — it is not the orchestrator. See patterns/multi-step-rca-workflow for the generalised pattern.
Ingested signals¶
| Axis | Signals |
|---|---|
| Traffic | inbound + outbound request rate, error rate |
| Latency | HTTP + gRPC + GraphQL |
| Saturation | container CPU, container memory |
| Kubernetes | container restarts, probe failures |
| JVM | heap usage, GC |
Rationale: "most services are backend JVM applications running on a Kubernetes-based compute platform." Generalising to non-JVM / non-Kubernetes stacks would require different analyzer prompts.
Prompt engineering¶
STAR names three explicit techniques:
- Role prompting — per-step persona / expertise framing.
- Prompt chaining — programmatic dialogue: each assistant reply is folded into the next user prompt.
- Generated-knowledge prompting — elicit intermediate domain facts before the final RCA step.
No RAG, no function calling, no MCP. Prompts are static templates with data interpolated per workflow run.
Token budgeting¶
STAR is a token-heavy system. Expedia sized it via back-of-the-envelope estimation against the GPT-4o tokenizer — fixed-length prompts (system prompt + chain-of-prompts) + variable-length prompts driven by prior responses. The load-bearing assumption: each response capped at 4k tokens. Without the cap the estimation is unbounded; the cap turns the whole pipeline's context usage into a finite sum.
Deliberate exclusions¶
| Feature | Status | Rationale |
|---|---|---|
| Function calling / tool use | No | Agent failure modes "less understood" |
| MCP servers | No | STAR is not agentic (future roadmap) |
| RAG | No | No runtime knowledge base |
| Short- / long-term memory | No | No conversational state |
| Conversational UI | No | API-only; fixed workflows |
| Streaming platform (Kafka) | No | Request-response traffic shape |
The exclusions are the load-bearing architectural statement: STAR is "an early iteration" that aims to demonstrate value before taking on agent-level failure modes. This makes STAR a canonical wiki instance of patterns/static-prompt-chain-over-agent-loop.
Use cases at Expedia¶
- Incident investigation. Reduce time to know / time to recover. Applied to several services during outages.
- Post-incident RCA. Produce initial draft for post-incident review tickets; SMEs supplement.
- Troubleshooting runbooks. Reliability engineering's existing runbooks re-encoded as STAR workflows. First addition: Kubernetes container-restart troubleshooting (sample output gist linked in source).
- Performance optimization. Recent experimental use case — a JVM heap-spike incident; STAR's analysis was reviewed and acted on by service owners.
- Failure-injection recommendation + analysis. STAR as a complement to Expedia's chaos engineering platform, providing the automatic experiment-result evaluator the platform previously lacked.
Evaluation¶
- Qualitative + SME-gated — no held-out accuracy / FP / FN numbers disclosed.
- Langfuse used for prompt management, evaluation, and tracing.
- "The results so far have been promising."
Roadmap (named in source)¶
Each item maps an existing deliberate exclusion onto a future step:
- Use specialised models per telemetry modality + slower reasoning models for the final RCA (vs single-model today).
- Add MCP tool use for dynamic data access.
- Add more context: service documentation, service metadata, service dependency graph.
- Expose a conversational interface.
- Improve testing + evaluation.
The roadmap is architecturally honest: STAR will graduate towards an agent only after the evaluation envelope makes the trade-off worthwhile.
Caveats¶
- No QPS / wall-clock / cost / accuracy numbers published.
- Model names not disclosed (abstracted by the GenAI proxy).
- Scope is RED/USE metrics + JVM + Kubernetes — not logs, not traces (yet), not custom business metrics.
Seen in¶
- sources/2026-04-28-expedia-expedias-service-telemetry-analyzer — canonical source for STAR's architecture, workflow, prompt engineering, token budgeting, Celery migration, five use cases, evaluation approach, and roadmap.
Related¶
- systems/expedia-generative-ai-proxy — STAR's LLM choke point.
- systems/fastapi / systems/celery / systems/redis — web + task-queue + broker substrate.
- systems/datadog — metrics source.
- systems/langfuse — prompt management + eval + tracing.
- systems/kubernetes — deployment platform + source of container-restart / probe-failure signals.
- systems/model-context-protocol — tool-use protocol STAR deliberately doesn't use today.
- concepts/automated-root-cause-analysis
- concepts/prompt-chaining
- concepts/role-prompting
- concepts/generated-knowledge-prompting
- concepts/context-engineering
- concepts/token-heavy-system
- concepts/time-to-know-vs-time-to-recover
- concepts/chaos-engineering
- patterns/static-prompt-chain-over-agent-loop
- patterns/multi-step-rca-workflow
- companies/expedia