---
title: Long Horizon: How Atlassian Built a Reasoning Engine for Complex AI Tasks
source: Atlassian Engineering
source_slug: atlassian
url: https://www.atlassian.com/blog/how-we-build/rovo-long-horizon-reasoning-engine
published: 2026-06-18
fetched: 2026-06-18T14:01:30+00:00
ingested: true
---

Last December, when the Rovo [hybrid orchestrator](https://www.atlassian.com/blog/atlassian-engineering/how-rovo-embraces-multi-agent-orchestration) was introduced, a hierarchical multi-agent system was introduced, where specialized subagents last introduced, one for Jira, one for Confluence, one for Slack, and so on — each handled their own domain. The hybrid orchestrator would decompose a user’s query, route subtasks to the right agent, and combine the results. This architecture was a strong fit for both sides of the equation: customers asking focused questions, and LLMs of the era reliably handling them in one or two tool calls. As frontier models matured and users began asking Rovo to plan, branch, and execute across many steps, the same patterns that made it fast and predictable became the bottleneck: rigid routing, shallow context, and little room for the model to reason about how to approach a problem.

## Why we needed a new architecture

As Rovo Chat matured, several architectural constraints of the Hybrid Orchestrator surfaced — each compounding as tasks grew in complexity.

### Building a Reasoning-Native Orchestrator

Each subagent ran in its own LLM context. The orchestrator routed work to them and received summaries back, but never saw raw tool outputs, intermediate reasoning, or error details. It made downstream decisions on a lossy, secondhand view of its own work, leading to missed connections across products, redundant searches, and no ability to recover gracefully when something failed mid-task.

### Limited iteration depth

Even when routing worked, the system was optimized for quick, two- to five-step tasks: search, read, answer. Its iteration budget was low, and its timeout was conservative. Tasks that required sustained reasoning across multiple data sources — _“analyze our sprint velocity across the last three quarters”_ or _“compare what different teams documented about this incident”_ — would frequently hit the ceiling before producing a thorough answer.

### Models outgrew the box

The multi-agent design wasn’t an accident — it was a workaround. Earlier-generation LLMs struggled with large, shared contexts and long tool catalogs; carving the problem into specialized subagents kept context manageable and tool selection accurate. Frontier models changed the equation on two fronts: they handle longer contexts and larger tool surfaces reliably, and a new class of reasoning models can plan, reflect, and iterate over many steps in a single coherent context. The old architecture couldn’t take advantage of either — it fragmented the very context these models needed to think in.

## The Long Horizon architecture

Long Horizon addresses these challenges with a fundamentally different design: one LLM, one context, one iterative loop.  
  
**Why “Long Horizon”?** In AI, a “horizon” defines how far into the future an agent can plan, reason, and execute tasks. While the previous Hybrid Orchestrator operated on a short, short-sighted horizon—built for quick, single-turn answers—this new architecture is designed for a **long horizon**. It gives the model the temporal and contextual space to plan complex, multi-step strategies, adapt to intermediate failures, and execute workflows spanning up to hundreds of iterations without losing its train of thought.

![](https://atlassianblog.wpengine.com/wp-content/uploads/2026/06/screenshot-2026-06-18-at-10.02.40-am.png)

### The core loop

The heart of Long Horizon is a simple, powerful loop:

  1. **Build system prompt** : Assemble the user’s context: organization, timezone, browsing history, conversation state, and any applicable skill templates.
  2. **Select and prepare tools** : Filter and flatten sub-agent tools into individual top-level actions available to the LLM.
  3. **Loop** (up to 150 iterations): 
     * Call the LLM with the full conversation state and available tools.
     * If the LLM returns tool call(s) → execute them (in parallel if independent), add results to context, continue.
     * If the LLM returns a final answer → deliver it to the user.
  4. **Respond:** Deliver results with citations and source links.


The model adapts its reasoning depth based on task complexity. Simple lookups get fast answers; multi-step research gets thorough reasoning.

#### 

### Adaptive reasoning

Long Horizon uses adaptive reasoning effort to match the depth of thinking to the complexity of each query. For straightforward lookups — “What’s the status of PROJ-123?” — the model applies minimal reasoning overhead and responds quickly. For multi-step research tasks — “Compare our sprint velocity across the last three quarters and identify trends” — it engages deeper reasoning to plan its approach, evaluate intermediate results, and synthesize a thorough answer.

This adaptive approach avoids the one-size-fits-all trade-off that most AI systems face: fast-but-shallow or thorough-but-slow. The same architecture handles both extremes by letting the model calibrate its own reasoning investment per turn.

### Flattened tool architecture

The single biggest architectural change in Long Horizon is how tools are exposed to the model. The previous Hybrid Orchestrator wrapped each product in its own subagent — JiraAgent, ConfluenceAgent, BitbucketAgent, and so on. The orchestrator’s job was to decompose a request and hand subtasks off to the right subagent; each subagent then picked its own product-specific actions. Because each hand-off was a separate LLM with its own prompt and context, information leaked at every hop: the subagent only saw a summarized version of the orchestrator’s task, the orchestrator only saw a summarized version of the subagent’s findings, and intermediate reasoning, tool errors, and raw tool outputs were compressed or dropped along the way. The orchestrator often had to re-ask, guess at what failed, or work from a lossy synopsis of what the subagent had actually seen — making it harder to recover from errors and harder to chain results across products. The architecture was also expensive to maintain: every model upgrade became an N-way migration, with each subagent needing its own re-tuning and re-evaluation pass for the new model.

Long Horizon replaces this with two ideas working together: flattening — collapsing every product’s capabilities into a single, uniformly named tool surface that the orchestrator’s LLM calls directly, so nothing is paraphrased through a second agent — and progressive disclosure — exposing that surface to the model on demand, so we don’t pay up-front for every tool’s schema as we connect more products or add atomic tools within a product.

### How it works

  1. **Flattening: one unified tool surface.** Every operation across our first-party products (Jira, Confluence, Bitbucket, Jira Service Management, Compass) and our third-party connectors (Google Calendar, Google Drive, Slack, GitHub, Microsoft Teams, and more) is exposed to the orchestrator as a typed, namespaced action — jira__search_issues, google_calendar__list_events, and so on — that the orchestrator’s LLM calls directly. The orchestrator now sees each tool’s raw arguments, raw response, and raw error — not a paraphrase from a subagent. When a call fails or returns something unexpected, the same LLM that decided to make the call also reads the failure and decides what to do next: retry with different arguments, fall back to a related tool, or surface the gap to the user. Recovery becomes part of the reasoning loop instead of information being summarized and potentially lost between separate stages.
  2. **Progressive disclosure: pay for what you use.** Sending the schemas for hundreds of flattened tools on every iteration would be expensive on every call, and would degrade the model’s tool-selection accuracy as the catalog grows.. Instead, each product namespace is collapsed to **two meta-tools** :


  * {product}__get_tool_schema — returns the full input schema for one specific tool on demand. Its description carries a compact one-line summary of every tool in that namespace, so the model can scan what’s available without paying the full schema cost.
  * {product}__invoke_tool — executes a tool in that namespace by name with its arguments.


Alongside each namespace we ship a **SKILL.md** — a short, hand-authored guide that captures the product-specific _business logic_ the model benefits from: which tool to reach for in which situation, how the product’s concepts map to user intent, the common multi-step recipes, and the gotchas. The skill encodes the per-product expertise that used to live implicitly inside a subagent prompt; the meta-tools provide the uniform invocation surface.

The most-used tools — search, todo-list management, file read/write, entity linking, memory retrieval, etc — stay flat at the top level because they’re called on nearly every turn and are worth keeping resident in the prompt.

Progressive disclosure does introduce a discovery step: for a tool the model hasn’t used recently, it first calls get_tool_schema and then invoke_tool. We accept that trade explicitly — the cost is real but bounded (schemas are fetched once per tool per task), and SKILL.md usually lets the model go straight to invoke_tool without a separate schema fetch.

### Why it helps

  * End-to-end visibility, end-to-end recovery. The same LLM that picks a tool also reads its raw response and its raw error, in the same context where it planned the call. Selection happens with the full task in view — not a subagent’s paraphrase of it — so the model picks the right tool more often the first time, and when something fails or returns the unexpected, it can retry, fall back, or pivot without first reconstructing what a subagent meant. The lossy hand-offs that used to mask failures are gone.
  * No per-product schema tax. The old architecture re-paid each product’s full schema bundle every time its subagent was invoked, and that bundle grew every time a product added a tool. Flattening removes that hidden tax entirely, and progressive disclosure means even within a product, we only pay for the specific tools the model reaches for in a given task.
  * One model migration, not N. A new frontier model is evaluated and rolled out once at the orchestrator; every product’s tools come along for the ride. What used to be a separate re-tuning and re-evaluation pass per subagent is now a single workstream.


### Context Window Management

A reasoning loop running for up to 150 iterations creates real pressure on the context window — older tool outputs accumulate, the token budget tightens, and the model’s attention has more to sift through. Long Horizon manages this along two dimensions: **compaction** within a single context, and **decomposition** across multiple contexts when a single one isn’t the right unit of work.

#### Context Compaction

Long Horizon runs a dedicated Context Compaction Service before each model call. When the conversation approaches the token limit, older tool outputs are trimmed or summarized while recent results are kept at full resolution. Pruned outputs aren’t discarded — they’re offloaded so the model can read them back on demand if it later needs the detail. This keeps long, multi-step runs within the context window without losing the reasoning the model has already done.

#### Task decomposition via child instances

Some tasks aren’t deep — they’re _wide_. A query like _“investigate last week’s checkout error spike across incidents, payment-service bugs, recent design docs, shipped PRs, and customer feedback”_ decomposes into five independent strands of research. Loading the full evidence for all five into a single context would overwhelm the window long before the model could synthesize an answer.

Instead, Long Horizon spawns a child instance of itself for each strand — each running the same one-LLM-one-context loop on its assigned subtask, with its own clean context and the most relevant skill (if available). Strands run concurrently inside a single user turn; the slowest strand, not the sum of them, sets the response time. The parent orchestrator receives a finished, self-contained result from each child and synthesizes them — never having to carry every intermediate finding from every strand in its own context.

![](https://atlassianblog.wpengine.com/wp-content/uploads/2026/06/screenshot-2026-06-18-at-10.05.24-am-scaled.png)

This is a fundamentally different role than the subagents in the previous architecture. Old subagents were product specialists _in the routing path_ — every tool call passed through one, and the orchestrator never saw what they saw. Child instances here are full Long Horizon reasoning loops, spawned on demand to own a complete piece of research and return it finished. Parallelism is a side effect; the primary motivation is keeping each context focused on what it needs to think about.

### Prompt assembly and caching

A long-horizon run sends the model the same system prompt, the same skill instructions, and a growing conversation history on every iteration. Across 150 iterations, that’s hundreds of thousands of tokens the provider would otherwise re-tokenize and re-process on each call. To avoid paying for them more than once, Long Horizon assembles every prompt from layers ordered from **most stable to most volatile** :

  1. **Static system prompt** — identical across every run
  2. **Stable session context** — organization, user, timezone, skill instructions (stable for the duration of a session)
  3. **Conversation history** — grows over time, but earlier turns are immutable once recorded
  4. **Turn-dependent context** — the current iteration’s tool results and reasoning state


Because each layer changes less frequently than the one after it, the longest possible prefix stays byte-identical from one model call to the next. OpenAI and Gemini reuse their implicit prefix cache automatically; for Anthropic, the assembler places explicit cache_control markers at the system, stable-context, and last-history boundaries to opt those prefixes into the cache. The result is that on most iterations, only the freshest tokens — typically a tool result and the model’s next reasoning step — actually need to be processed from scratch.

This isn’t a context-management technique — the model sees the same context with or without caching. It’s a cost and latency win that compounds with the number of iterations: the longer a task runs, the more the cache pays off.

## The skills system

Long Horizon includes a skills framework — pre-authored, domain-specific prompt templates for common research patterns. When a user’s query matches a skill (e.g., sprint planning, bug triage, performance review, competitive analysis), the corresponding skill injects a proven research strategy into the system prompt.

Each skill provides:

  * Step-by-step workflows that encode best practices
  * Domain-specific knowledge about which tools to call and in what order
  * Proven patterns for structuring the final output


Skills are individually feature-flagged, configurable per tenant, and loaded at runtime. They represent the accumulated knowledge of what works well for specific task types — condensed into reusable templates that guide the reasoning loop toward higher-quality outcomes.

## Observability for Long-Running Agent Tasks

A reasoning loop that spans dozens of iterations and tool calls produces trace data that traditional request-scoped logging was never designed for. We instrument Long Horizon with structured LLM tracing — capturing every orchestrator decision, tool invocation, latency breakdown, and token cost as a hierarchical trace tree. Engineers can drill from a top-level orchestrator span down through each reasoning iteration to pinpoint root causes: wrong tool selection, silent failures, retries that burned budget, or context window pressure building across iterations. This trajectory-level observability is what makes it possible to debug a 40-step research task the same way you’d debug a distributed microservice call — except the “services” are LLM reasoning steps and the “RPCs” are tool calls.

## Results

### Production comparison

Compared with the production Hybrid Orchestrator, Long Horizon showed statistically significant gains in user-facing quality while keeping the latency trade-off visible and measurable.

#### Task success — did the agent answer the question correctly?

**Metric**| **What is measures**| **Long horizon vs Prod**  
---|---|---  
Offline quality metrics| Accuracy on a curated set of hard, multi-tool queries scored by an LLM judge against reference answers.| **+8.5%**  
Chat Success Rate| Live A/B user satisfaction signal combining thumbs feedback, reformulation rates, and session outcomes.| **+0.83%** in the higher-reasoning configuration  
Task completion| Binary pass/fail — did the agent fully complete Confluence tasks (find page, retrieve content, create/edit)?| **+23%** relative  
  
  * End-to-end answer accuracy on a curated eval dataset of queries requiring 2+ tool calls across products (Jira, Confluence, Slack, etc.). Each query runs through the full agent loop, and an LLM judge scores the final response against reference criteria.


Offline evals show Long Horizon at 77% accuracy versus 71% for the production Hybrid Orchestrator plus model updates. Online results also showed stronger CSRv3, with the larger gains concentrated in more complex, tool-heavy queries. This points to the core benefit of the architecture: keeping reasoning, tool calls, raw results, and recovery decisions in one loop helps most when tasks require multiple steps.

###   
Latency Profile

Long Horizon introduces a different latency trade-off compared to our previous production architecture. For straightforward, no-tool queries, the baseline Time to First Byte (TTFB) is slightly higher under our latency-optimized configurations than it was with the original Hybrid Orchestrator. However, this represents a massive optimization over early engineering iterations of the engine, where initial latency overhead was a much more significant bottleneck.

We balanced this baseline latency shift across three key dimensions that focus heavily on the actual user experience:

  * **Perceived latency dropped by 37%** Instead of staring at a static loading indicator, users now receive real-time streaming progress updates as the system works through its reasoning steps. Making the model’s internal thought process transparent makes the wait feel significantly shorter and more engaging.
  * **Substantially Higher Answer Quality:** For complex queries, the previous system often delivered fast but ultimately inadequate answers. Long Horizon takes the necessary time to plan, cross-reference data sources, and deliver a comprehensive response that actually resolves the user’s true intent on the first try.
  * **Streamlined Execution Loop:** The flattened architecture completely eliminates redundant orchestration overhead. Because the model calls tools directly and processes raw responses or errors natively, we no longer waste valuable execution time on multi-layered abstractions or agent-to-agent paraphrasing.


The flattened architecture also reduces duplicated orchestration work: the model calls tools directly, sees raw responses and errors, and can recover without a second agent paraphrasing the result.

We believe the combination of visibly higher quality and transparent reasoning steps makes this an acceptable trade-off, and we are actively working to bring simple-query latency down further.

## How it compares

Dimension| Hybrid Orchestrator| Long Horizon  
---|---|---  
**Architecture**|  Coordinator + Specialists: LLM picks an agent → agent picks an action| One LLM with all tools: LLM directly calls flattened tool actions  
**LLM calls per tool**|  2 (orchestrator + sub-agent)| 1 (direct)  
**Iteration budget**|  Low single-digit| 100+  
**Tool disclosure**|  All tools every iteration| All tools flattened and available from the first iteration  
**Quality gates**|  None| Adaptive reasoning with complexity-aware depth  
**Context management**|  No active management| Explicit eviction at 95% token limit  
**Skills system**|  None| 14+ predefined skills with per-tenant overrides  
**Timeout**|  10 minutes| 20 minutes  
  
## What’s next

Long Horizon is now the foundation, not the finish line. Our roadmap focuses on three directions:

  * **Long-running tasks** — Moving beyond synchronous chat into durable, trackable tasks that Rovo executes on behalf of the user. Tasks will have structured plans, progress tracking, checkpoints for human review, and the ability to survive disconnections and resume later. This shifts Rovo from “better chatbot” to an agentic work platform where users assign work, not just ask questions.
  * **Faster simple queries** — We’re actively working to reduce TTFB for straightforward queries through smarter routing, faster models, and complexity-aware orchestration that can bypass the full reasoning loop when a simple answer will do.
  * **Deeper cross-product workflows** — With a unified context and an expressive tool surface, Long Horizon can orchestrate work across Jira, Confluence, Slack, Bitbucket, and connected third-party tools. We’re expanding its ability to not just find information across these products, but to take coordinated actions — creating issues, publishing pages, posting updates — as part of a single coherent workflow.


If you’re interested in how we approached a similar architectural challenge for deep research tasks, check out <https://www.atlassian.com/blog/atlassian-engineering/how-rovo-deep-research-works> and <https://www.atlassian.com/blog/artificial-intelligence/rovo-deep-research-v2>.