SYSTEM Cited by 1 source

Expedia STAR (Service Telemetry Analyzer)¶

Definition¶

STAR (Service Telemetry Analyzer) is Expedia's web service for investigating service degradations and outages using observability metrics and LLMs. It is a FastAPI service that reads metrics from Datadog and calls Expedia's internal generative-AI proxy to run a fixed multi-step workflow — telemetry collection → per-metric analysis → aggregated root-cause analysis → insights + recommendations.

STAR is deliberately not an agent: no function calling, no MCP tool use, no short- or long-term memory, no RAG, no conversational UI. The design goal is "a) simple, b) precise (to a certain extent, considering the potential hallucinations of the models), and c) that avoids the additional and currently less understood failure modes of an agent" (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).

Architecture¶

Web tier¶

FastAPI — API layer + web server.
Celery + Redis — task queue + broker + result backend. STAR V0 used FastAPI's async/await + background tasks; V1 migrated to Celery/Redis "as part of scaling up" to decouple analysis latency from the HTTP request and to absorb rate-limiter round-trips against Datadog + the GenAI proxy.
Not Kafka: "This architecture aligns with STAR's request-response flow, and we don't need a streaming platform like Kafka, at least for now."

Integrations¶

Datadog — Expedia's chosen metrics platform; STAR reads metrics via the Datadog API.
Generative-AI proxy — internal LLM choke point. Handles authn/authz and exposes multiple models that Expedia "constantly evaluate[s]" for quality + cost + performance. STAR is model-agnostic at the application layer; the proxy abstracts model choice.
Langfuse — prompt management, evaluation, and tracing.

Workflow¶

collect telemetry ──▶ per-metric analysis ──▶ aggregate RCA
    (Datadog API)     (domain prompts +         (reasoning
                       rules, per signal)        model run)
                                                     │
                                                     ▼
                                                insights +
                                                recommendations

The ordering is hardcoded. The LLM is a reasoning component inside a deterministic workflow — it is not the orchestrator. See patterns/multi-step-rca-workflow for the generalised pattern.

Ingested signals¶

Axis	Signals
Traffic	inbound + outbound request rate, error rate
Latency	HTTP + gRPC + GraphQL
Saturation	container CPU, container memory
Kubernetes	container restarts, probe failures
JVM	heap usage, GC

Rationale: "most services are backend JVM applications running on a Kubernetes-based compute platform." Generalising to non-JVM / non-Kubernetes stacks would require different analyzer prompts.

Prompt engineering¶

STAR names three explicit techniques:

Role prompting — per-step persona / expertise framing.
Prompt chaining — programmatic dialogue: each assistant reply is folded into the next user prompt.
Generated-knowledge prompting — elicit intermediate domain facts before the final RCA step.

No RAG, no function calling, no MCP. Prompts are static templates with data interpolated per workflow run.

Token budgeting¶

STAR is a token-heavy system. Expedia sized it via back-of-the-envelope estimation against the GPT-4o tokenizer — fixed-length prompts (system prompt + chain-of-prompts) + variable-length prompts driven by prior responses. The load-bearing assumption: each response capped at 4k tokens. Without the cap the estimation is unbounded; the cap turns the whole pipeline's context usage into a finite sum.

Deliberate exclusions¶

Feature	Status	Rationale
Function calling / tool use	No	Agent failure modes "less understood"
MCP servers	No	STAR is not agentic (future roadmap)
RAG	No	No runtime knowledge base
Short- / long-term memory	No	No conversational state
Conversational UI	No	API-only; fixed workflows
Streaming platform (Kafka)	No	Request-response traffic shape

The exclusions are the load-bearing architectural statement: STAR is "an early iteration" that aims to demonstrate value before taking on agent-level failure modes. This makes STAR a canonical wiki instance of patterns/static-prompt-chain-over-agent-loop.

Use cases at Expedia¶

Incident investigation. Reduce time to know / time to recover. Applied to several services during outages.
Post-incident RCA. Produce initial draft for post-incident review tickets; SMEs supplement.
Troubleshooting runbooks. Reliability engineering's existing runbooks re-encoded as STAR workflows. First addition: Kubernetes container-restart troubleshooting (sample output gist linked in source).
Performance optimization. Recent experimental use case — a JVM heap-spike incident; STAR's analysis was reviewed and acted on by service owners.
Failure-injection recommendation + analysis. STAR as a complement to Expedia's chaos engineering platform, providing the automatic experiment-result evaluator the platform previously lacked.

Evaluation¶

Qualitative + SME-gated — no held-out accuracy / FP / FN numbers disclosed.
Langfuse used for prompt management, evaluation, and tracing.
"The results so far have been promising."

Roadmap (named in source)¶

Each item maps an existing deliberate exclusion onto a future step:

Use specialised models per telemetry modality + slower reasoning models for the final RCA (vs single-model today).
Add MCP tool use for dynamic data access.
Add more context: service documentation, service metadata, service dependency graph.
Expose a conversational interface.
Improve testing + evaluation.

The roadmap is architecturally honest: STAR will graduate towards an agent only after the evaluation envelope makes the trade-off worthwhile.

Caveats¶

No QPS / wall-clock / cost / accuracy numbers published.
Model names not disclosed (abstracted by the GenAI proxy).
Scope is RED/USE metrics + JVM + Kubernetes — not logs, not traces (yet), not custom business metrics.

Seen in¶

sources/2026-04-28-expedia-expedias-service-telemetry-analyzer — canonical source for STAR's architecture, workflow, prompt engineering, token budgeting, Celery migration, five use cases, evaluation approach, and roadmap.

systems/expedia-generative-ai-proxy — STAR's LLM choke point.
systems/fastapi / systems/celery / systems/redis — web + task-queue + broker substrate.
systems/datadog — metrics source.
systems/langfuse — prompt management + eval + tracing.
systems/kubernetes — deployment platform + source of container-restart / probe-failure signals.
systems/model-context-protocol — tool-use protocol STAR deliberately doesn't use today.
concepts/automated-root-cause-analysis
concepts/prompt-chaining
concepts/role-prompting
concepts/generated-knowledge-prompting
concepts/context-engineering
concepts/token-heavy-system
concepts/time-to-know-vs-time-to-recover
concepts/chaos-engineering
patterns/static-prompt-chain-over-agent-loop
patterns/multi-step-rca-workflow
companies/expedia