Skip to content

SYSTEM Cited by 1 source

Expedia STAR (Service Telemetry Analyzer)

Definition

STAR (Service Telemetry Analyzer) is Expedia's web service for investigating service degradations and outages using observability metrics and LLMs. It is a FastAPI service that reads metrics from Datadog and calls Expedia's internal generative-AI proxy to run a fixed multi-step workflow — telemetry collection → per-metric analysis → aggregated root-cause analysis → insights + recommendations.

STAR is deliberately not an agent: no function calling, no MCP tool use, no short- or long-term memory, no RAG, no conversational UI. The design goal is "a) simple, b) precise (to a certain extent, considering the potential hallucinations of the models), and c) that avoids the additional and currently less understood failure modes of an agent" (Source: sources/2026-04-28-expedia-expedias-service-telemetry-analyzer).

Architecture

Web tier

  • FastAPI — API layer + web server.
  • Celery + Redis — task queue + broker + result backend. STAR V0 used FastAPI's async/await + background tasks; V1 migrated to Celery/Redis "as part of scaling up" to decouple analysis latency from the HTTP request and to absorb rate-limiter round-trips against Datadog + the GenAI proxy.
  • Not Kafka: "This architecture aligns with STAR's request-response flow, and we don't need a streaming platform like Kafka, at least for now."

Integrations

  • Datadog — Expedia's chosen metrics platform; STAR reads metrics via the Datadog API.
  • Generative-AI proxy — internal LLM choke point. Handles authn/authz and exposes multiple models that Expedia "constantly evaluate[s]" for quality + cost + performance. STAR is model-agnostic at the application layer; the proxy abstracts model choice.
  • Langfuse — prompt management, evaluation, and tracing.

Workflow

collect telemetry ──▶ per-metric analysis ──▶ aggregate RCA
    (Datadog API)     (domain prompts +         (reasoning
                       rules, per signal)        model run)
                                                insights +
                                                recommendations

The ordering is hardcoded. The LLM is a reasoning component inside a deterministic workflow — it is not the orchestrator. See patterns/multi-step-rca-workflow for the generalised pattern.

Ingested signals

Axis Signals
Traffic inbound + outbound request rate, error rate
Latency HTTP + gRPC + GraphQL
Saturation container CPU, container memory
Kubernetes container restarts, probe failures
JVM heap usage, GC

Rationale: "most services are backend JVM applications running on a Kubernetes-based compute platform." Generalising to non-JVM / non-Kubernetes stacks would require different analyzer prompts.

Prompt engineering

STAR names three explicit techniques:

No RAG, no function calling, no MCP. Prompts are static templates with data interpolated per workflow run.

Token budgeting

STAR is a token-heavy system. Expedia sized it via back-of-the-envelope estimation against the GPT-4o tokenizer — fixed-length prompts (system prompt + chain-of-prompts) + variable-length prompts driven by prior responses. The load-bearing assumption: each response capped at 4k tokens. Without the cap the estimation is unbounded; the cap turns the whole pipeline's context usage into a finite sum.

Deliberate exclusions

Feature Status Rationale
Function calling / tool use No Agent failure modes "less understood"
MCP servers No STAR is not agentic (future roadmap)
RAG No No runtime knowledge base
Short- / long-term memory No No conversational state
Conversational UI No API-only; fixed workflows
Streaming platform (Kafka) No Request-response traffic shape

The exclusions are the load-bearing architectural statement: STAR is "an early iteration" that aims to demonstrate value before taking on agent-level failure modes. This makes STAR a canonical wiki instance of patterns/static-prompt-chain-over-agent-loop.

Use cases at Expedia

  1. Incident investigation. Reduce time to know / time to recover. Applied to several services during outages.
  2. Post-incident RCA. Produce initial draft for post-incident review tickets; SMEs supplement.
  3. Troubleshooting runbooks. Reliability engineering's existing runbooks re-encoded as STAR workflows. First addition: Kubernetes container-restart troubleshooting (sample output gist linked in source).
  4. Performance optimization. Recent experimental use case — a JVM heap-spike incident; STAR's analysis was reviewed and acted on by service owners.
  5. Failure-injection recommendation + analysis. STAR as a complement to Expedia's chaos engineering platform, providing the automatic experiment-result evaluator the platform previously lacked.

Evaluation

  • Qualitative + SME-gated — no held-out accuracy / FP / FN numbers disclosed.
  • Langfuse used for prompt management, evaluation, and tracing.
  • "The results so far have been promising."

Roadmap (named in source)

Each item maps an existing deliberate exclusion onto a future step:

  • Use specialised models per telemetry modality + slower reasoning models for the final RCA (vs single-model today).
  • Add MCP tool use for dynamic data access.
  • Add more context: service documentation, service metadata, service dependency graph.
  • Expose a conversational interface.
  • Improve testing + evaluation.

The roadmap is architecturally honest: STAR will graduate towards an agent only after the evaluation envelope makes the trade-off worthwhile.

Caveats

  • No QPS / wall-clock / cost / accuracy numbers published.
  • Model names not disclosed (abstracted by the GenAI proxy).
  • Scope is RED/USE metrics + JVM + Kubernetes — not logs, not traces (yet), not custom business metrics.

Seen in

Last updated · 433 distilled / 1,256 read