---
title: Expedia’s Service Telemetry Analyzer
source: Expedia Group Tech
source_slug: expedia
url: https://medium.com/expedia-group-tech/expedias-service-telemetry-analyzer-60f2f96c5351?source=rss----38998a53046f---4
published: 2026-04-28
fetched: 2026-04-28T14:01:34+00:00
ingested: true
---

# Expedia’s Service Telemetry Analyzer

## A system that facilitates investigation of service degradations and outages using service telemetry data and AI

Press enter or click to view image in full size

Photo by Evangelos Mpikakis on Unsplash.

The recent advancements in the artificial intelligence space make us re-evaluate how work is done. From programming, to designing systems, or even operating them in production. While there is considerable focus on automating programming, one area which could undergo transformation is how we monitor and operate our systems and services.

A few of us came together and designed Expedia’s® Service Telemetry Analyzer (STAR), an early iteration of a system that facilitates investigation of service degradations and outages using service telemetry data and AI models and techniques.

Expedia’s Service Telemetry Analyzer (STAR)

The early product offering includes:

  * Execution of multi-step workflows.
  * Integration of software and systems engineering knowledge, including application and infrastructure, cloud, containerization, and orchestration patterns, into diagnostic workflows for complex distributed systems.
  * Application of domain-specific prompt engineering for metric and root cause analysis.
  * Utilization of advanced off-the-shelf AI models.
  * Implementation of prompt engineering techniques, including r[ole prompting](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts), p[rompt chaining](https://www.promptingguide.ai/techniques/prompt_chaining), and g[enerated knowledge prompting](https://www.promptingguide.ai/techniques/knowledge).


## Design

The product offering is a web-based service that provides an application programming interface (API). While AI agents and chatbots are gaining traction, we aimed to start with something a) simple, b) precise (to a certain extent, considering the potential hallucinations of the models), and c) that avoids the additional and currently less understood failure modes of an agent. As this field evolves, we will continue to iterate on the design.

Therefore, there is limited [context engineering](https://www.philschmid.de/context-engineering) beyond domain-specific prompts; for instance, there is no support for [function calling](https://platform.openai.com/docs/guides/function-calling) / [tool use](https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview), short-term and long-term memory, or r[etrieval augmented generation](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) (RAG). The system provides vertical domain-specific workflows. It adheres to a predefined multi-step process, with an emphasis on automating scenarios encountered by our engineers and enhancing the system’s precision. If you are interested in tool use with [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) servers for software development, you can read more in my [public blogpost](https://nikoskatirtzis.substack.com/p/experimenting-with-the-model-context).

### Web service

The architecture is relatively straightforward, comprising an API layer and a web server built with [FastAPI](https://fastapi.tiangolo.com/). This service manages requests to Expedia’s chosen metrics platform (Datadog) and to the internal generative AI proxy, including authentication/authorization.

Press enter or click to view image in full size

Web architecture behind STAR

### AI models

The service invokes models via Expedia’s generative AI proxy. The proxy offers access to different models which we constantly evaluate for quality of results, cost, and performance implications. We are also exploring using different models for the various tasks in STAR. The use of large language models (LLMs) for any tasks was convenient for the prototype but it would be more effective to use specialised models for the different modalities of telemetry data and slower reasoning models for the final RCA.

### Prompt chaining

Part of the implementation involves prompt chaining, which facilitates a programmatic dialogue between the user and the assistant.

Press enter or click to view image in full size

Prompt chaining; programmatic dialogue between the user and the assistant

### Multi-step workflows

Overall, STAR provides multi-step workflows, which are visualized below. In specific:

  1. It collects telemetry data.
  2. It analyzes these metrics and the associated metadata using AI models and domain-specific prompts and rules.
  3. It aggregates all analyses and conducts a final root cause analysis.
  4. It returns insights and recommendations.


Press enter or click to view image in full size

Multi-step Reasoning Process implemented in STAR

### Ingested data

The initial focus on observability metrics was on infrastructure components, with a particular emphasis on Kubernetes and JVM for two reasons: our heterogeneous tech stack and the higher degree of standardization at the infrastructure layer.

The default analyzer now ingests metrics including inbound and outbound traffic and errors, latency across various protocols like HTTP, gRPC, and GraphQL, and saturation monitored through container-level CPU and memory usage.

Additionally, the system ingests Kubernetes metrics, such as container restarts and probe failures, as well as JVM metrics for heap usage and garbage collection. This set of signals is tailored to our environment, where most services are backend JVM applications running on a Kubernetes-based compute platform.

## Implementation details

While designing this system we faced a set of of interesting problems which may be useful to the reader.

### The nuances of token-heavy systems

When we first designed STAR, LLM tooling was limited. Given STAR is a token-heavy system and in order to understand the feasibility and implications of this, we followed a systematic approach for back-of-the envelope estimation, grounded in facts, assumptions, and enforced limits.

We estimated the number of tokens using [OpenAI’s GPT-4o tokenizer](https://platform.openai.com/tokenizer). For this, we took into account any payload sent as context to the models. This included fixed-length prompts for system prompts and the chain-of-prompts, as well as prompts the length of which depends on previous responses. To control the number of tokens we capped each response to 4k tokens. This number was then used for estimation purposes.

Based on this analysis and the relatively static nature of the system, we concluded that we can accommodate the [context window](https://www.ibm.com/think/topics/context-window) size. Note that this differs between models and has been increasing over time.

### Datadog and generative AI proxy limits

Both Datadog and Expedia’s Generative AI proxy have rate limiting in place. Even though the scale is still small and the number of metrics per workflow is fixed, we accommodate these limitations using common resiliency patterns, while also leveraging asynchronous operations and batch processing.

### Architectural evolution

This service is mostly I/O bound, but we still have synchronous operations. Each analysis is independent, yet we need to provide a response on the status of the analysis to the user. For this, we initially used certain features from FastAPI such as [async/await](https://fastapi.tiangolo.com/async/) and [background tasks](https://fastapi.tiangolo.com/tutorial/background-tasks/). As part of scaling up, we moved to [Celery](https://docs.celeryq.dev/en/latest/getting-started/introduction.html) with Redis acting as the broker and result backend to store the state and results of tasks. This architecture aligns with STAR’s request-response flow, and we don’t need a streaming platform like Kafka, at least for now.

## Use cases

Numerous use cases could emerge for such a system. Below is a summary of how we have utilized STAR so far.

### Incident investigation

The primary use case and the rationale behind the design of STAR. Our objective with this service was to minimize the time to know (TTK) and time to recover (TTR). By enabling rapid analysis of observability data and evaluation of hypotheses, this service proved to be a valuable time-saving tool. We applied STAR to several services that experienced outages.

### Post-incident root cause analysis

Following an incident, teams file a ticket for post-incident review. By running STAR for the affected service(s) and the time-window of the incident, we can provide an initial analysis. This can then be reviewed and supplemented by human expertise.

### Troubleshooting

Engineers spend a significant amount of time troubleshooting systems. Over time, Expedia’s reliability engineering group has documented troubleshooting steps in the company’s internal reliability hub. A logical step was to implement guides relying on metric data as workflows in STAR.

Our first addition was the process our engineers normally follow for troubleshooting container restarts in our Kubernetes-based compute platform. An indicative analysis result is available at <https://gist.github.com/nikos912000/1e489021b406f682d70c14f3ebbad917>.

### Performance optimization

This is a recent use case that we are still evaluating. An Expedia service faced an issue where the JVM memory heap usage would suddenly spike. Such occurrences can be problematic; while container restarts can temporarily mitigate them, they expose long-standing issues that may lead to incidents.

Running STAR for this service provided a valuable analysis which was then reviewed and taken forward by the owners of the service.

### Failure injection recommendation and analysis

Another idea involves recommending failures to inject and analyzing the impact of injected failures utilizing Expedia’s [chaos engineering platform](/expedia-group-tech/chaos-engineering-at-expedia-group-e51a0288ee2). When we developed this platform, we lacked a mechanism for the automatic evaluation of experimental results. STAR could serve as a complementary tool to this platform.

## Evaluation

We are still in the early stages of evaluating the system. Given the complexity of this domain, we mostly rely on qualitative human assessment which includes subject matter experts (SMEs) and users. We also use [Langfuse](https://langfuse.com/docs) for prompt management, evaluation, and tracing. The results so far have been promising.

## Next steps

As we iterate through this early prototype, our emphasis is on identifying high-leverage use cases, improving testing and evaluation, and adapting to this rapidly evolving field. As it was mentioned earlier, this is still a static system rather than a sophisticated multi-agent architecture, lacking core elements of context engineering. It may benefit from tool use through MCP servers and from additional context such as service documentation, metadata, or the dependency graph of the targeted service. In the future we could also expose a conversational interface.